Smart Mobility Analysis

Focus Group: Transport and Mobility

  • Authored by: Chathu Siriwardena
  • Duration: 10 Weeks
  • Level: Intermediate
  • Pre-requisite Skills: Python, Data Wrangling, Data Visualisation, Data Modeling, Machine Learning, Deep Learning, Geographical Coordinates Handling
Scenario
This use case focuses on integrating pedestrian traffic, tree canopy coverage, and weather conditions to improve urban mobility, promote sustainable commuting, and enhance city planning. By leveraging data-driven insights, the City of Melbourne can design walkable, climate-resilient, and commuter-friendly urban spaces.
User Story¶
  • As a commuter, I want to find shaded and pedestrian-friendly routes so that I can walk comfortably, even during extreme weather.

  • As a city planner, I want to identify heat-prone pedestrian areas so that we can prioritise tree planting and optimize transport connectivity.

  • As a business owner, I want to understand foot traffic patterns near my store so that I can adjust my operations based on customer movement trends.

At the end of this use case you will:¶
  • Learn to import data sets using API v2.1
  • Gain proficiency in merging multiple datasets to create a comprehensive view.
  • Learn data visualisation using matplotlib and seaborn
  • Understand geospatial analysis by working with geolocations using libraries like Geopy and Folium to map pedestrian routes, transport networks, and tree canopy coverage.
  • Develop a "Cool Routes" scoring model that combines tree canopy data, pedestrian counts, and weather conditions to identify optimal walking paths for heat resilience.
  • Develop regression models and basic Feed-Forward Neural Networks (FFNN) to predict foot traffic demand based on weather conditions, and tree canopy avilability for sustainable and efficient commuting.
  • Evaluate the impact of the Green Commute on pedestrian satisfaction and heat stress reduction
Data Sets Used:¶

Data Set 1. Pedestrian Counting System
This data set contains ID,Location ID, Base Sensing Date, Hour Day, Direction 1, Direction 2, Pedestrain Count, Sensor Name and Location. The data set was used to identify the movements of pedestrains around city area. The dataset is imported from Melbourne Open Data website, using API V2.

Data Set 2. Tree Canopies Data
This data set contains geo_point_2d, geo_shape, objectid, shape_leng, shape_area. The datahset contains tree canopy within City of Melbourne mapped using 2018 aerial photos and LiDARd city area. The dataset is imported from Melbourne Open Data website, u.sing API V2

Data Set 3. Bus Stop Data
This data set contains geo_point_2d, geo_shape, objectid, addresspt1, addressp_1, asset_clas, asset_type, objectid, str_id, addresspt, asset_subt, model_desc, mcc_id, roadseg_id, description, model_no. This data set shows the locations of the bus stops within the city of Melbourne. The dataset is imported from Melbourne Open Data website, using API V2.

Data Set 4. City Circle Tram Stops Data
This data set contains geo_point_2d, geo_shape, name, xorg, stop_no, mccid_str, xsource, xdate, mccid_int. The data set contains the city circle tram service data within Melbourne city. The dataset is imported from Melbourne Open Data website, using API V2.

Data Set 5. Microclimate Sensors Data
This data set contains device_id, received_at, sensorlocation, latlong, minimumwinddirection, averagewinddirection, maximumwinddirection, minimumwindspeed, averagewindspeed, gustwindspeed, airtemperature, relativehumidity, atmosphericpressure, pm25, pm10, noise. The data set contains This dataset contains climate readings from climate sensors located within the City. The dataset is imported from Melbourne Open Data website, using API V2.

Outline of the Use Case¶
  1. Data Preprocessing

    I started use case by cleaning and preparing each dataset for analysis. This involves handling missing values and duplicates: Remove or impute missing values in latitude, longitude, and other critical fields.

  2. Data Visualisation

  • Interactive Maps: Used tools like Folium to create an interactive map showing the disribution of foot traffic, canopy coverage and distribution of public transport stops.
  • Bar Charts, Stack bar charts, Pie charts, multiple bar charts and other graphs and tables to identify the key insights
  1. Feature Engineering

Next, I created features that will help the model understand the relationship between pedestrian traffic, tree canopy coverage, and weather conditions

  • Weather Index Calculation: This index measures how comfortable the weather is for walking based on temperature, wind speed, and humidity.

  • Canopy Coverage Ratio : This is a measure of how much of a pedestrian area is covered by tree canopy. It is calculated as the proportion of pedestrian observations that fall within areas shaded by tree canopies.

  • Stress Index Calculation: This index estimates pedestrian stress, increasing when foot traffic is high and tree canopy coverage is low.

  • Walkability Score Calculation: This score combines pedestrian activity, tree canopy, low stress levels, and favorable weather into a single value to assess how walkable an area is. Higher scores indicate better walking environments.

  1. Model Selection and Model Building
  • Geospatial Clustering Model (Density-based):
    • DBSCAN: The DBSCAN clustering applied by grouping street segments that share similar characteristics such as pedestrian activity, weather comfort, canopy coverage, and environmental stress. Clustering transforms raw, multidimensional urban data into actionable intelligence, supporting data driven decisions for improving pedestrian experiences, reducing heat stress, and enhancing city livability.
    • K-means: Clustered areas to analyse the distribution of walkability score for different clusters.
  • Regression Model:
    • Multiple Linear Regression/GLM: Predicted the walkability score based on features like weather index, stress index .
    • Logistic Regression: A binary model created to classify whether the walkability score is sufficient or insufficient based on the input features.
  • Random Forests/Gradient Boosting:

For more complex relationships, Random Forest or Gradient Boosting models used to predict walkability sufficiency based on multiple features, including geospatial ones.

  • Deep Learning Approach for Predicting walkability with Custom Metrics (FFNN)
  1. Model Evaluation Metrics
  • Evaluated the model using below metrics:

    • Mean Absolute Error (MAE) / Mean Squared Error (MSE): For regression models to predict sufficient walkability.
    • Clustering metrics: Evaluated the density-based clusters using silhouette score.
    • Classification metrics: For logistic regression, use accuracy, precision, recall, and F1-score to assess the sufficiency of walkability.
  1. Outputs

    Walkability Score and Catogerisation : Developed a scoring system to rank based on walkability score for each area within the city of melbourne.

Importing Required Libraries¶

The below code imports a range of libraries essential for data analysis, visualisation, mapping, and interactivity. requests is used to fetch data from APIs, while pandas and numpy support data manipulation and numerical operations. StringIO helps handle in-memory text data, such as loading CSVs from strings. For geolocation tasks, geopy and its Nominatim geocoder are used to convert place names into coordinates. folium enables the creation of interactive maps, and ipywidgets along with IPython.display allows for interactive elements within a Jupyter Notebook. Visualisation is handled by seaborn and matplotlib.pyplot, with Patch from matplotlib.patches used for custom legends or shapes in plots, and the datetime module is included for working with date and time data, which is often crucial in temporal analysis.

In [89]:
import requests
import pandas as pd
import numpy as np
from io import StringIO
import datetime
import geopy
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import folium
from ipywidgets import interact, widgets
from IPython.display import display
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import re as re
import matplotlib.colors as mcolors
from geopy.extra.rate_limiter import RateLimiter
import seaborn as sns
from geopy.distance import geodesic
from sklearn.cluster import KMeans,DBSCAN
from sklearn.preprocessing import MinMaxScaler,StandardScaler, PowerTransformer
import geopandas as gpd
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc, silhouette_score,confusion_matrix, ConfusionMatrixDisplay, auc
from sklearn.ensemble import RandomForestRegressor
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, BatchNormalization, LeakyReLU
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras import backend as K
from IPython.display import display, clear_output
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import json
from shapely.geometry import shape
from shapely.geometry import Point
from folium.plugins import HeatMap
Loading all Data sets¶
Data Set 1: Pedestrian Counting System.
In [3]:
base_url='https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
dataset_id='pedestrian-counting-system-monthly-counts-per-hour'

url=f'{base_url}{dataset_id}/exports/csv'
params={'select':'*','limit':-1,'lang':'en','timezone':'UTC'}

response=requests.get(url,params=params)

if response.status_code==200:
    url_content=response.content.decode('utf-8')
    pedestrian_df=pd.read_csv(StringIO(url_content),delimiter=';')
    print(pedestrian_df.head(10))
else:
    print(f'Request failed with status code {response.status_code}')
             id  location_id sensing_date  hourday  direction_1  direction_2  \
0   72120220515           72   2022-05-15        1          104          172   
1  471720240917           47   2024-09-17       17         1273          894   
2  172320211101           17   2021-11-01       23            8            6   
3  171820230726           17   2023-07-26       18          267          383   
4   24820250405           24   2025-04-05        8          213          218   
5   54320240224           54   2024-02-24        3           13            5   
6   50420250303           50   2025-03-03        4            1            0   
7  143020250508          143   2025-05-08        0           52           19   
8   21720221210            2   2022-12-10       17         1671         1129   
9  391320231018           39   2023-10-18       13          204          203   

   pedestriancount sensor_name                    location  
0              276      ACMI_T  -37.81726338, 144.96872809  
1             2167    Eli250_T   -37.81258467, 144.9625775  
2               14     Col15_T  -37.81362543, 144.97323591  
3              650     Col15_T  -37.81362543, 144.97323591  
4              431    Col620_T  -37.81887963, 144.95449198  
5               18    Swa607_T    -37.804024, 144.96308399  
6                1    Lyg309_T  -37.79808192, 144.96721013  
7               71   Spencer_T    -37.821728, 144.95557015  
8             2800    Bou283_T  -37.81380668, 144.96516718  
9              407     AlfPl_T  -37.81379749, 144.96995745  

Data Set 2: Tree Canopies Data.

In [4]:
base_url='https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
dataset_id='tree-canopies-public-realm-2018-urban-forest'


url=f'{base_url}{dataset_id}/exports/csv'
params={'select':'*','limit':-1,'lang':'en','timezone':'UTC'}

response=requests.get(url,params=params)

if response.status_code==200:
    url_content=response.content.decode('utf-8')
    tree_canopies_df=pd.read_csv(StringIO(url_content),delimiter=';')
    print(tree_canopies_df.head(10))
else:
    print(f'Request failed with status code {response.status_code}')
                              geo_point_2d  \
0   -37.81304517121492, 144.98612858745977   
1  -37.813031352270215, 144.98264073647684   
2   -37.81261020314892, 144.96112288812233   
3   -37.81219284514014, 144.93846977801448   
4   -37.81239953857732, 144.95122560445583   
5  -37.813040580695024, 144.98654806873841   
6    -37.81231922742188, 144.9447777601162   
7   -37.81218994603368, 144.94262980622725   
8   -37.81245033141797, 144.98815520131134   
9    -37.81244314561024, 144.9495590639178   

                                           geo_shape  objectid  shape_leng  \
0  {"coordinates": [[[[144.98613240697972, -37.81...     10373    2.692370   
1  {"coordinates": [[[[144.98267255431483, -37.81...     10379   55.155123   
2  {"coordinates": [[[[144.96112403835852, -37.81...     10380    6.279844   
3  {"coordinates": [[[[144.93847665550007, -37.81...     10399    7.048844   
4  {"coordinates": [[[[144.95122528646937, -37.81...     10400    2.794252   
5  {"coordinates": [[[[144.98655398642268, -37.81...     10385    4.334477   
6  {"coordinates": [[[[144.94478614151308, -37.81...     10387    8.128402   
7  {"coordinates": [[[[144.9426325334389, -37.812...     10438    7.923251   
8  {"coordinates": [[[[144.98817843816417, -37.81...     10669   10.680974   
9  {"coordinates": [[[[144.94963279484296, -37.81...     10393   60.743893   

   shape_area  
0    0.488406  
1  125.461002  
2    2.816221  
3    3.643475  
4    0.612298  
5    1.348686  
6    4.911725  
7    4.257095  
8    4.845412  
9  165.844186  

Data Set 3: Bus Stops Data.

In [5]:
base_url='https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
dataset_id='bus-stops'


url=f'{base_url}{dataset_id}/exports/csv'
params={'select':'*','limit':-1,'lang':'en','timezone':'UTC'}

response=requests.get(url,params=params)

if response.status_code==200:
    url_content=response.content.decode('utf-8')
    bus_stops_df=pd.read_csv(StringIO(url_content),delimiter=';')
    print(bus_stops_df.head(10))
else:
    print(f'Request failed with status code {response.status_code}')
                              geo_point_2d  \
0   -37.80384165792465, 144.93239283833262   
1    -37.81548699581418, 144.9581794249902   
2   -37.81353897396532, 144.95728334230756   
3   -37.82191394843844, 144.95539345270072   
4   -37.83316401267591, 144.97443745130263   
5   -37.79436108568101, 144.92998424529242   
6  -37.817452093555325, 144.96168480565794   
7    -37.82146476463953, 144.9303191551562   
8  -37.837547087144706, 144.98191138368836   
9  -37.812490976626215, 144.95370614040704   

                                           geo_shape  prop_id  addresspt1  \
0  {"coordinates": [144.93239283833262, -37.80384...        0   76.819824   
1  {"coordinates": [144.9581794249902, -37.815486...        0   21.561304   
2  {"coordinates": [144.95728334230756, -37.81353...        0   42.177187   
3  {"coordinates": [144.95539345270072, -37.82191...        0   15.860434   
4  {"coordinates": [144.97443745130263, -37.83316...        0    0.000000   
5  {"coordinates": [144.92998424529242, -37.79436...        0    3.105722   
6  {"coordinates": [144.96168480565794, -37.81745...        0    7.239726   
7  {"coordinates": [144.9303191551562, -37.821464...        0   32.180664   
8  {"coordinates": [144.98191138368836, -37.83754...        0   41.441167   
9  {"coordinates": [144.95370614040704, -37.81249...        0   16.143764   

   addressp_1 asset_clas               asset_type  objectid   str_id  \
0         357    Signage  Sign - Public Transport       355  1235255   
1          83    Signage  Sign - Public Transport       600  1231226   
2         207    Signage  Sign - Public Transport       640  1237092   
3         181    Signage  Sign - Public Transport       918  1232777   
4           0    Signage  Sign - Public Transport      1029  1271914   
5         112    Signage  Sign - Public Transport      1139  1577059   
6         268    Signage  Sign - Public Transport      1263  1481028   
7         298    Signage  Sign - Public Transport      2527  1245221   
8          78    Signage  Sign - Public Transport      2922  1248743   
9          99    Signage  Sign - Public Transport      5111  1253565   

   addresspt  asset_subt                       model_desc   mcc_id  \
0     570648         NaN  Sign - Public Transport 1 Panel  1235255   
1     548056         NaN  Sign - Public Transport 1 Panel  1231226   
2     543382         NaN  Sign - Public Transport 1 Panel  1237092   
3     103975         NaN  Sign - Public Transport 1 Panel  1232777   
4          0         NaN  Sign - Public Transport 1 Panel  1271914   
5     616011         NaN  Sign - Public Transport 1 Panel  1577059   
6     527371         NaN  Sign - Public Transport 1 Panel  1481028   
7     110521         NaN  Sign - Public Transport 1 Panel  1245221   
8     107419         NaN  Sign - Public Transport 1 Panel  1248743   
9     602160         NaN  Sign - Public Transport 1 Panel  1253565   

   roadseg_id                                        descriptio model_no  
0       21673  Sign - Public Transport 1 Panel Bus Stop Type 13     P.16  
1       20184   Sign - Public Transport 1 Panel Bus Stop Type 8     P.16  
2       20186   Sign - Public Transport 1 Panel Bus Stop Type 8     P.16  
3       22174   Sign - Public Transport 1 Panel Bus Stop Type 8     P.16  
4       22708   Sign - Public Transport 1 Panel Bus Stop Type 8     P.16  
5       21693   Sign - Public Transport 1 Panel Bus Stop Type 1     P.16  
6       20171   Sign - Public Transport 1 Panel Bus Stop Type 3     P.16  
7       30638   Sign - Public Transport 1 Panel Bus Stop Type 3     P.16  
8       22245   Sign - Public Transport 1 Panel Bus Stop Type 8     P.16  
9       20030   Sign - Public Transport 1 Panel Bus Stop Type 8     P.16  

Data Set 4: City Circle Tram Stops Data.

In [6]:
base_url='https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
dataset_id='city-circle-tram-stops'


url=f'{base_url}{dataset_id}/exports/csv'
params={'select':'*','limit':-1,'lang':'en','timezone':'UTC'}

response=requests.get(url,params=params)

if response.status_code==200:
    url_content=response.content.decode('utf-8')
    tram_stops_df=pd.read_csv(StringIO(url_content),delimiter=';')
    print(tram_stops_df.head(10))
else:
    print(f'Request failed with status code {response.status_code}')
                              geo_point_2d  \
0   -37.82023778673241, 144.95786314283018   
1   -37.82097269970027, 144.95546153614245   
2   -37.82190465062153, 144.95109855638137   
3  -37.811771476718356, 144.95644059700524   
4   -37.81105928060848, 144.95891745116262   
5   -37.80961884837298, 144.96384957029932   
6  -37.808876998255194, 144.96634474519394   
7   -37.81358116790275, 144.97406360491075   
8    -37.8176316450406, 144.96690455927876   
9    -37.818324403770184, 144.964479208357   

                                           geo_shape  \
0  {"coordinates": [144.95786314283018, -37.82023...   
1  {"coordinates": [144.95546153614245, -37.82097...   
2  {"coordinates": [144.95109855638137, -37.82190...   
3  {"coordinates": [144.95644059700524, -37.81177...   
4  {"coordinates": [144.95891745116262, -37.81105...   
5  {"coordinates": [144.96384957029932, -37.80961...   
6  {"coordinates": [144.96634474519394, -37.80887...   
7  {"coordinates": [144.97406360491075, -37.81358...   
8  {"coordinates": [144.96690455927876, -37.81763...   
9  {"coordinates": [144.964479208357, -37.8183244...   

                                   name      xorg stop_no  mccid_str  xsource  \
0  Melbourne Aquarium / Flinders Street  GIS Team       2        NaN  Mapbase   
1      Spencer Street / Flinders Street  GIS Team       1        NaN  Mapbase   
2       The Goods Shed / Wurundjeri Way  GIS Team      D5        NaN  Mapbase   
3      William Street / La Trobe Street  GIS Team       3        NaN  Mapbase   
4        Queen Street / La Trobe Street  GIS Team       4        NaN  Mapbase   
5     Swanston Street / La Trobe Street  GIS Team       6        NaN  Mapbase   
6      Russell Street / La Trobe Street  GIS Team       7        NaN  Mapbase   
7           Parliament / Collins Street  GIS Team       8        NaN  Mapbase   
8     Swanston Street / Flinders Street  GIS Team       5        NaN  Mapbase   
9    Elizabeth Street / Flinders Street  GIS Team       4        NaN  Mapbase   

        xdate  mccid_int  
0  2011-10-18          4  
1  2011-10-18          5  
2  2011-10-18          7  
3  2011-10-18         16  
4  2011-10-18         17  
5  2011-10-18         19  
6  2011-10-18         20  
7  2011-10-18         25  
8  2011-10-18          1  
9  2011-10-18          2  

Data Set 5: Micro Climate Data.

In [7]:
base_url='https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
dataset_id='microclimate-sensors-data'


url=f'{base_url}{dataset_id}/exports/csv'
params={'select':'*','limit':-1,'lang':'en','timezone':'UTC'}

response=requests.get(url,params=params)

if response.status_code==200:
    url_content=response.content.decode('utf-8')
    climate_df=pd.read_csv(StringIO(url_content),delimiter=';')
    print(climate_df.head(10))
else:
    print(f'Request failed with status code {response.status_code}')
            device_id                received_at  \
0  ICTMicroclimate-08  2025-02-09T00:54:37+00:00   
1  ICTMicroclimate-11  2025-02-09T01:02:11+00:00   
2  ICTMicroclimate-05  2025-02-09T01:03:24+00:00   
3  ICTMicroclimate-01  2025-02-09T01:02:43+00:00   
4  ICTMicroclimate-09  2025-02-09T01:17:37+00:00   
5  ICTMicroclimate-05  2025-02-09T01:18:26+00:00   
6  ICTMicroclimate-02  2025-02-09T01:26:51+00:00   
7  ICTMicroclimate-07  2025-02-09T01:35:39+00:00   
8  ICTMicroclimate-01  2025-02-09T01:32:44+00:00   
9  ICTMicroclimate-04  2025-02-09T01:38:22+00:00   

                                      sensorlocation  \
0  Swanston St - Tram Stop 13 adjacent Federation...   
1                                   1 Treasury Place   
2                 Enterprize Park - Pole ID: COM1667   
3                    Birrarung Marr Park - Pole 1131   
4  SkyFarm (Jeff's Shed). Rooftop - Melbourne Con...   
5                 Enterprize Park - Pole ID: COM1667   
6                         101 Collins St L11 Rooftop   
7  Tram Stop 7C - Melbourne Tennis Centre Precinc...   
8                    Birrarung Marr Park - Pole 1131   
9                                        Batman Park   

                    latlong  minimumwinddirection  averagewinddirection  \
0  -37.8184515, 144.9678474                   0.0                 153.0   
1   -37.812888, 144.9750857                   0.0                 144.0   
2  -37.8204083, 144.9591192                   0.0                  45.0   
3  -37.8185931, 144.9716404                   NaN                 150.0   
4  -37.8223306, 144.9521696                   0.0                 241.0   
5  -37.8204083, 144.9591192                   0.0                 357.0   
6   -37.814604, 144.9702991                   0.0                 357.0   
7  -37.8222341, 144.9829409                   0.0                  91.0   
8  -37.8185931, 144.9716404                   NaN                 143.0   
9  -37.8221828, 144.9562225                   0.0                  10.0   

   maximumwinddirection  minimumwindspeed  averagewindspeed  gustwindspeed  \
0                 358.0               0.0               3.9            7.9   
1                 356.0               0.0               2.0            7.8   
2                 133.0               0.0               1.5            2.7   
3                   NaN               NaN               1.6            NaN   
4                 359.0               0.0               0.9            4.4   
5                  32.0               1.6               1.9            2.2   
6                 359.0               0.0               0.5            1.4   
7                 356.0               0.0               0.9            4.4   
8                   NaN               NaN               1.9            NaN   
9                 356.0               0.0               1.5            6.9   

   airtemperature  relativehumidity  atmosphericpressure  pm25  pm10  \
0            23.9         57.300000               1009.7   0.0   0.0   
1            24.5         56.200000               1005.3   0.0   0.0   
2            25.0         60.000000               1009.6   1.0   3.0   
3            23.1         61.099998               1009.0   0.0   5.0   
4            25.6         53.700000               1007.9   0.0   0.0   
5            24.5         58.700000               1009.3   1.0   3.0   
6            26.6         51.800000               1004.7   1.0   3.0   
7            26.6         49.200000               1011.3   0.0   0.0   
8            23.9         59.599998               1008.5   0.0   5.0   
9            26.5         51.800000               1011.9   0.0   0.0   

       noise  
0  80.500000  
1  62.900000  
2  68.500000  
3  51.700001  
4  60.200000  
5  68.700000  
6  69.200000  
7  64.500000  
8  53.200001  
9  72.300000  

Pedestrian Counting System Data.¶

I performed different data cleaning methods.

  • Droped 'id', 'location_id', 'direction-1','direction-2' , 'location'and 'sensor_name' columns and renamed 'Sensing_date' as 'Date'.
  • Created 'latitude' and 'longitude' columns from 'location' column
In [11]:
pedestrian_df.head(10)
Out[11]:
id location_id sensing_date hourday direction_1 direction_2 pedestriancount sensor_name location
0 6220230214 6 2023-02-14 2 13 21 34 FliS_T -37.81911705, 144.96558255
1 85720220127 85 2022-01-27 7 25 14 39 488Mac_T -37.79432415, 144.92973378
2 8020250505 8 2025-05-05 0 2 0 2 WebBN_T -37.82293543, 144.9471751
3 5420240510 5 2024-05-10 4 6 1 7 PriNW_T -37.81874249, 144.96787656
4 25620240110 25 2024-01-10 6 91 106 197 MCEC_T -37.82401776, 144.95604426
5 671520220214 67 2022-02-14 15 198 224 422 FLDegS_T -37.81688755, 144.96562569
6 1611720241226 161 2024-12-26 17 54 868 922 BirArt1109_T -37.81851276, 144.97131336
7 18020231016 18 2023-10-16 0 0 2 2 Col12_T -37.81344862, 144.97305353
8 871120230710 87 2023-07-10 11 67 67 134 Errol23_T -37.80454949, 144.94921863
9 282320221012 28 2022-10-12 23 42 36 78 VAC_T -37.82129925, 144.96879309

The above output shows the first 10 rows of the pedestrian sensor dataset. Each row represents hourly pedestrian counts recorded by a specific sensor at a particular location and time and provides an initial view of the structure and granularity of the dataset, confirming that it contains both spatial and temporal information, which is essential for further analysis and visualisation.

In [151]:
pedestrian_df.shape
Out[151]:
(2305280, 9)
In [12]:
pedestrian_df.nunique()
Out[12]:
id                 2311436
location_id             98
sensing_date          1413
hourday                 24
direction_1           3179
direction_2           3269
pedestriancount       5052
sensor_name             96
location                98
dtype: int64

There are 2,259,125 records and 9 variables (or features) with each row representing a unique observation of pedestrian counts at a specific time and location.

In [8]:
# Split 'location' into 'latitude' and 'longitude'
pedestrian_df[['latitude', 'longitude']] = pedestrian_df['location'].str.split(', ', expand=True)

# Drop 'id', 'location_id', 'direction-1','direction-2' , 'location'and 'sensor_name' columns
pedestrian_df = pedestrian_df.drop(columns=['id', 'location_id', 'direction_1','direction_2', 'location'])

# Rename 'Sensing_date' to 'Date'
pedestrian_df = pedestrian_df.rename(columns={'sensing_date': 'Date'})

pedestrian_df.head(10)
Out[8]:
Date hourday pedestriancount sensor_name latitude longitude
0 2022-05-15 1 276 ACMI_T -37.81726338 144.96872809
1 2024-09-17 17 2167 Eli250_T -37.81258467 144.9625775
2 2021-11-01 23 14 Col15_T -37.81362543 144.97323591
3 2023-07-26 18 650 Col15_T -37.81362543 144.97323591
4 2025-04-05 8 431 Col620_T -37.81887963 144.95449198
5 2024-02-24 3 18 Swa607_T -37.804024 144.96308399
6 2025-03-03 4 1 Lyg309_T -37.79808192 144.96721013
7 2025-05-08 0 71 Spencer_T -37.821728 144.95557015
8 2022-12-10 17 2800 Bou283_T -37.81380668 144.96516718
9 2023-10-18 13 407 AlfPl_T -37.81379749 144.96995745

The location column contains both latitude and longitude as a single string. Hence that string splits into two separate columns latitude and longitude which makes it easier to work with coordinates for mapping or spatial analysis. Other steps clean and streamline the dataset by removing redundant information, improving clarity, and preparing it for visualisation, filtering, or analysis focused on time and location based pedestrian patterns.

Adding Street Names to the Pedestrian Data set

In [9]:
# Loading Street Names 
data = """
latitude	longitude	Street
-37.811	144.9643	Swanston St
-37.8213	144.9688	St Kilda Rd
-37.8169	144.9656	Flinders Ln
-37.8112	144.9666	Lonsdale St
-37.8127	144.9539	King St
-37.8146	144.9429	Docklands
-37.8189	144.9545	Collins St
-37.8124	144.9655	Swanston St
-37.8191	144.9656	Flinders Walk
-37.8133	144.9668	Bourke St
-37.8198	144.951	Collins St
-37.8083	144.963	A Beckett St
-37.82	144.9687	St Kilda Rd
-37.8165	144.9612	Queen St
-37.8063	144.9587	toria St
-37.8163	144.9709	Flinders St
-37.8141	144.9661	Swanston St
-37.8188	144.9471	Bourke St
-37.8169	144.9656	Flinders Ln
-37.8187	144.9679	Federation Square
-37.8031	144.9491	Queensberry St
-37.8196	144.9633	Flinders Walk
-37.82	144.9598	St Kilda Rd
-37.8077	144.9631	Swanston St
-37.8127	144.9679	King St
-37.7945	144.9304	Macaulay Rd
-37.8169	144.9536	Flinders Ln
-37.818	144.965	Flinders St
-37.8134	144.9731	Collins St
-37.7981	144.9672	Lygon St
-37.8156	144.9397	Docklands
-37.8046	144.9495	Errol St
-37.8179	144.9662	Flinders St
-37.8173	144.9687	Russell St
-37.813	144.9516	La Trobe St
-37.8205	144.9413	Bourke St
-37.81	144.9622	La Trobe St
-37.8144	144.9443	Harbour Esplanade
-37.8061	144.9564	toria St
-37.8126	144.9626	Elizabeth St
-37.8136	144.9732	Spring Street
-37.8017	144.9666	Lygon St
-37.7984	144.9641	Monash Rd
-37.8229	144.9472	Navigation Dr
-37.8217	144.9556	Rebecca Walk
-37.8124	144.9714	Swanston St
-37.8045	144.9492	Errol St
-37.8153	144.9523	Lonsdale St
-37.8201	144.9576	King St
-37.8001	144.9639	Swanston St
-37.81	144.9723	La Trobe St
-37.8197	144.968	Arts Centre Melbourne
-37.804	144.9631	Swanston St
-37.8106	144.9644	Little Lonsdale St
-37.8168	144.9656	Flinders Ln
-37.8157	144.9668	Swanston St
-37.8024	144.9616	Pelham St
-37.8202	144.9651	Southgate Ave
-37.8173	144.9532	Federation Square
-37.8125	144.9619	Lonsdale St
-37.8152	144.9747	Flinders St
-37.8167	144.9669	Swanston St
-37.8084	144.9591	Franklin St
-37.8011	144.967	Lygon St
-37.824	144.956	Convention Centre Pl
-37.813	144.9516	La Trobe St
-37.8138	144.9652	Bourke St
-37.8147	144.9447	Docklands
-37.8123	144.9615	Lonsdale St
-37.8169	144.953	Flinders Ln
-37.8156	144.9655	Docklands
-37.8073	144.9596	Elizabeth St
-37.8201	144.9629	King St
-37.813	144.9568	La Trobe St
-37.8189	144.9461	Collins St
-37.8135	144.9652	Bourke St
-37.8117	144.9682	Little Bourke St
-37.8184	144.9736	Batman Ave
-37.8149	144.9661	Swanston St
-37.8119	144.9562	Flagstaff Station
-37.8138	144.97	Bourke St
-37.8259	144.9619	Balston St
-37.8163	144.9709	Flinders St
-37.8074	144.9599	Elizabeth St
-37.8191	144.9545	Flinders Walk
-37.809	144.9493	Spencer St
-37.795	144.9353	Macaulay Rd
-37.8185	144.9713	Princes Walk
-37.8125	144.9569	Lonsdale St
-37.7943	144.9297	Macaulay Rd
-37.797	144.9644	Elgin St
-37.8239	144.963	Power St
-37.8175	144.9733	Batman Ave
-37.82	144.9583	St Kilda Rd
-37.8163	144.9555	Flinders St
-37.8176	144.9733	Batman Ave
-37.8095	144.9494	State Route
-37.8101	144.9614	La Trobe St
"""

# Load into DataFrame
street_df = pd.read_csv(StringIO(data), sep="\t")

# Displaying Data Frame
street_df.head()
Out[9]:
latitude longitude Street
0 -37.8110 144.9643 Swanston St
1 -37.8213 144.9688 St Kilda Rd
2 -37.8169 144.9656 Flinders Ln
3 -37.8112 144.9666 Lonsdale St
4 -37.8127 144.9539 King St

From the street names data set the street names were mapped to the pedestrian data set.

In [10]:
# Change latitude and longitude to numeric values
pedestrian_df['latitude'] = pd.to_numeric(pedestrian_df['latitude'], errors='coerce')
pedestrian_df['longitude'] = pd.to_numeric(pedestrian_df['longitude'], errors='coerce')
street_df['latitude'] = pd.to_numeric(street_df['latitude'], errors='coerce')
street_df['longitude'] = pd.to_numeric(street_df['longitude'], errors='coerce')

# Dropping rows with missing coordinate values
pedestrian_df = pedestrian_df.dropna(subset=['latitude', 'longitude'])
street_df = street_df.dropna(subset=['latitude', 'longitude'])

# Rounding off coordinates
pedestrian_df['lat_round'] = pedestrian_df['latitude'].round(3)
pedestrian_df['lon_round'] = pedestrian_df['longitude'].round(3)
street_df['lat_round'] = street_df['latitude'].round(3)
street_df['lon_round'] = street_df['longitude'].round(3)

# Merge pedestrian data set with rounded coordinates
pedestrian_df_N = pd.merge(
    pedestrian_df,
    street_df[['lat_round', 'lon_round', 'Street']],
    on=['lat_round', 'lon_round'],
    how='left'
)
pedestrian_df_N = pedestrian_df_N.drop(columns=['lat_round', 'lon_round'])

# Displaying result
pedestrian_df_N.head()
Out[10]:
Date hourday pedestriancount sensor_name latitude longitude Street
0 2022-05-15 1 276 ACMI_T -37.817263 144.968728 Russell St
1 2024-09-17 17 2167 Eli250_T -37.812585 144.962578 Elizabeth St
2 2021-11-01 23 14 Col15_T -37.813625 144.973236 Spring Street
3 2023-07-26 18 650 Col15_T -37.813625 144.973236 Spring Street
4 2025-04-05 8 431 Col620_T -37.818880 144.954492 Collins St

The street names were mapped to the pedestrian data set

Tree Canopies Data.¶

I performed different data cleaning methods.

  • Droped 'geo_shape', 'geo_point_2d' and 'objectid' columns.
  • Created 'latitude' and 'longitude' columns from 'geo_point_2d' column
In [17]:
tree_canopies_df.head(10)
Out[17]:
geo_point_2d geo_shape objectid shape_leng shape_area
0 -37.81304517121492, 144.98612858745977 {"coordinates": [[[[144.98613240697972, -37.81... 10373 2.692370 0.488406
1 -37.813031352270215, 144.98264073647684 {"coordinates": [[[[144.98267255431483, -37.81... 10379 55.155123 125.461002
2 -37.81261020314892, 144.96112288812233 {"coordinates": [[[[144.96112403835852, -37.81... 10380 6.279844 2.816221
3 -37.81219284514014, 144.93846977801448 {"coordinates": [[[[144.93847665550007, -37.81... 10399 7.048844 3.643475
4 -37.81239953857732, 144.95122560445583 {"coordinates": [[[[144.95122528646937, -37.81... 10400 2.794252 0.612298
5 -37.813040580695024, 144.98654806873841 {"coordinates": [[[[144.98655398642268, -37.81... 10385 4.334477 1.348686
6 -37.81231922742188, 144.9447777601162 {"coordinates": [[[[144.94478614151308, -37.81... 10387 8.128402 4.911725
7 -37.81218994603368, 144.94262980622725 {"coordinates": [[[[144.9426325334389, -37.812... 10438 7.923251 4.257095
8 -37.81245033141797, 144.98815520131134 {"coordinates": [[[[144.98817843816417, -37.81... 10669 10.680974 4.845412
9 -37.81244314561024, 144.9495590639178 {"coordinates": [[[[144.94963279484296, -37.81... 10393 60.743893 165.844186

The above table displays the first 10 entries of the tree canopy dataset, which provides spatial data on tree canopy coverage in Melbourne. Each row corresponds to a specific tree canopy polygon. This dataset is essential for spatial analysis and visualisation of tree cover in relation to other urban features like pedestrian movement, public transport access, or heat mapping.

In [156]:
tree_canopies_df.shape
Out[156]:
(32787, 5)

There are 32,787 individual tree canopy records, each representing a unique canopy area in the city and the dataset includes five attributes per record.

In [18]:
tree_canopies_df.nunique()
Out[18]:
geo_point_2d    32787
geo_shape       32785
objectid        32787
shape_leng      32737
shape_area      32740
dtype: int64
In [11]:
# Split 'geo_point_2d' into 'latitude' and 'longitude'
tree_canopies_df[['latitude', 'longitude']] = tree_canopies_df['geo_point_2d'].str.split(', ', expand=True)

# Drop 'geo_shape', 'geo_point_2d' and 'objectid' columns
tree_canopies_df = tree_canopies_df.drop(columns=['objectid', 'geo_point_2d'])

tree_canopies_df.head(10)
Out[11]:
geo_shape shape_leng shape_area latitude longitude
0 {"coordinates": [[[[144.98613240697972, -37.81... 2.692370 0.488406 -37.81304517121492 144.98612858745977
1 {"coordinates": [[[[144.98267255431483, -37.81... 55.155123 125.461002 -37.813031352270215 144.98264073647684
2 {"coordinates": [[[[144.96112403835852, -37.81... 6.279844 2.816221 -37.81261020314892 144.96112288812233
3 {"coordinates": [[[[144.93847665550007, -37.81... 7.048844 3.643475 -37.81219284514014 144.93846977801448
4 {"coordinates": [[[[144.95122528646937, -37.81... 2.794252 0.612298 -37.81239953857732 144.95122560445583
5 {"coordinates": [[[[144.98655398642268, -37.81... 4.334477 1.348686 -37.813040580695024 144.98654806873841
6 {"coordinates": [[[[144.94478614151308, -37.81... 8.128402 4.911725 -37.81231922742188 144.9447777601162
7 {"coordinates": [[[[144.9426325334389, -37.812... 7.923251 4.257095 -37.81218994603368 144.94262980622725
8 {"coordinates": [[[[144.98817843816417, -37.81... 10.680974 4.845412 -37.81245033141797 144.98815520131134
9 {"coordinates": [[[[144.94963279484296, -37.81... 60.743893 165.844186 -37.81244314561024 144.9495590639178

The geo_point_2d column contains location data as a single string ("latitude, longitude").This line splits it into two separate columns as latitude and longitude, which is more convenient for mapping and spatial joins.

objectid and geo_point_2d columns are removed to clean the dataset

Above steps streamline the dataset by focusing on the essential information such as canopy area, shape length, and geographic coordinates. This makes the data easier to work with for visualisation and spatial analysis.

Bus Stops Data.¶

I performed different data cleaning methods.

  • Droped 'geo_shape', 'prop_id', 'geo_point_2d' , 'addresspt1', addressp_1 'asset_clas', 'asset_type', 'objectid','str_id','addresspt','asset_subt','model_desc','mcc_id' ,'roadseg_id', 'descriptio', and 'model_no' columns
  • Created 'latitude' and 'longitude' columns from 'geo_point_2d' column.
  • Added Stop Type columnn
  • Removed Duplicates
In [20]:
bus_stops_df.head(10)
Out[20]:
geo_point_2d geo_shape prop_id addresspt1 addressp_1 asset_clas asset_type objectid str_id addresspt asset_subt model_desc mcc_id roadseg_id descriptio model_no
0 -37.80384165792465, 144.93239283833262 {"coordinates": [144.93239283833262, -37.80384... 0 76.819824 357 Signage Sign - Public Transport 355 1235255 570648 NaN Sign - Public Transport 1 Panel 1235255 21673 Sign - Public Transport 1 Panel Bus Stop Type 13 P.16
1 -37.81548699581418, 144.9581794249902 {"coordinates": [144.9581794249902, -37.815486... 0 21.561304 83 Signage Sign - Public Transport 600 1231226 548056 NaN Sign - Public Transport 1 Panel 1231226 20184 Sign - Public Transport 1 Panel Bus Stop Type 8 P.16
2 -37.81353897396532, 144.95728334230756 {"coordinates": [144.95728334230756, -37.81353... 0 42.177187 207 Signage Sign - Public Transport 640 1237092 543382 NaN Sign - Public Transport 1 Panel 1237092 20186 Sign - Public Transport 1 Panel Bus Stop Type 8 P.16
3 -37.82191394843844, 144.95539345270072 {"coordinates": [144.95539345270072, -37.82191... 0 15.860434 181 Signage Sign - Public Transport 918 1232777 103975 NaN Sign - Public Transport 1 Panel 1232777 22174 Sign - Public Transport 1 Panel Bus Stop Type 8 P.16
4 -37.83316401267591, 144.97443745130263 {"coordinates": [144.97443745130263, -37.83316... 0 0.000000 0 Signage Sign - Public Transport 1029 1271914 0 NaN Sign - Public Transport 1 Panel 1271914 22708 Sign - Public Transport 1 Panel Bus Stop Type 8 P.16
5 -37.79436108568101, 144.92998424529242 {"coordinates": [144.92998424529242, -37.79436... 0 3.105722 112 Signage Sign - Public Transport 1139 1577059 616011 NaN Sign - Public Transport 1 Panel 1577059 21693 Sign - Public Transport 1 Panel Bus Stop Type 1 P.16
6 -37.817452093555325, 144.96168480565794 {"coordinates": [144.96168480565794, -37.81745... 0 7.239726 268 Signage Sign - Public Transport 1263 1481028 527371 NaN Sign - Public Transport 1 Panel 1481028 20171 Sign - Public Transport 1 Panel Bus Stop Type 3 P.16
7 -37.82146476463953, 144.9303191551562 {"coordinates": [144.9303191551562, -37.821464... 0 32.180664 298 Signage Sign - Public Transport 2527 1245221 110521 NaN Sign - Public Transport 1 Panel 1245221 30638 Sign - Public Transport 1 Panel Bus Stop Type 3 P.16
8 -37.837547087144706, 144.98191138368836 {"coordinates": [144.98191138368836, -37.83754... 0 41.441167 78 Signage Sign - Public Transport 2922 1248743 107419 NaN Sign - Public Transport 1 Panel 1248743 22245 Sign - Public Transport 1 Panel Bus Stop Type 8 P.16
9 -37.812490976626215, 144.95370614040704 {"coordinates": [144.95370614040704, -37.81249... 0 16.143764 99 Signage Sign - Public Transport 5111 1253565 602160 NaN Sign - Public Transport 1 Panel 1253565 20030 Sign - Public Transport 1 Panel Bus Stop Type 8 P.16

Above table shows the first 10 rows of the bus stops dataset, which includes geographic and descriptive details about public transport signage (bus stops) in the area. Each row represents one bus stop sign.

In [21]:
bus_stops_df.shape
Out[21]:
(309, 16)

The bus_stops_df dataset contains 309 rows and 16 columns, representing detailed information for 309 individual bus stop assets.

In [22]:
bus_stops_df.nunique()
Out[22]:
geo_point_2d    295
geo_shape       295
prop_id           6
addresspt1      274
addressp_1      193
asset_clas        1
asset_type        1
objectid        309
str_id          309
addresspt       241
asset_subt        0
model_desc        1
mcc_id          309
roadseg_id      198
descriptio        8
model_no          1
dtype: int64
In [12]:
# Split 'geo_point_2d' into 'latitude' and 'longitude'
bus_stops_df[['latitude', 'longitude']] = bus_stops_df['geo_point_2d'].str.split(', ', expand=True)

# Drop 'geo_shape', 'prop_id', 'geo_point_2d' , 'addresspt1', addressp_1 'asset_clas', 'asset_type', 'objectid', 'str_id','addresspt','asset_subt','model_desc','mcc_id' ,'roadseg_id', 'descriptio', and 'model_no' columns
bus_stops_df = bus_stops_df.drop(columns=['geo_shape', 'geo_point_2d','prop_id', 'addresspt1', 'addressp_1', 'asset_clas', 'asset_type', 'objectid', 'str_id','addresspt','asset_subt','model_desc','mcc_id' ,'roadseg_id', 'descriptio', 'model_no' ])

#Added Stop Type column
bus_stops_df['stop_type'] = 'Bus Stop'

#Remove duplicates
bus_stops_df = bus_stops_df.drop_duplicates(subset=['latitude', 'longitude'])

bus_stops_df.head(10)
Out[12]:
latitude longitude stop_type
0 -37.80384165792465 144.93239283833262 Bus Stop
1 -37.81548699581418 144.9581794249902 Bus Stop
2 -37.81353897396532 144.95728334230756 Bus Stop
3 -37.82191394843844 144.95539345270072 Bus Stop
4 -37.83316401267591 144.97443745130263 Bus Stop
5 -37.79436108568101 144.92998424529242 Bus Stop
6 -37.817452093555325 144.96168480565794 Bus Stop
7 -37.82146476463953 144.9303191551562 Bus Stop
8 -37.837547087144706 144.98191138368836 Bus Stop
9 -37.812490976626215 144.95370614040704 Bus Stop

I splitted the geo_point_2d column into two separate columns: latitude and longitude, making it easier to work with geographic data, and removed columns that are either irrelevant or redundant for the analysis, keeping only the necessary data (like latitude, longitude, and stop_type). A new column, stop_type, is added and set to 'Bus Stop' for all rows, specifying the type of asset. Removed any duplicate bus stops that may have the same coordinates, ensuring each bus stop is unique based on its location.

In [24]:
bus_stops_df.shape
Out[24]:
(295, 3)
In [25]:
bus_stops_df.nunique()
Out[25]:
latitude     295
longitude    295
stop_type      1
dtype: int64

Tram Stops Data.¶

I performed different data cleaning methods.

  • Droped 'geo_shape', 'geo_point_2d','xorg', 'stop_no', 'mccid_str', 'xsource', 'axdate', and 'mccid_int' columns.
  • Created 'latitude' and 'longitude' columns from 'geo_point_2d' column.
  • Added Stop Type columnn.
In [26]:
tram_stops_df.head(10)
Out[26]:
geo_point_2d geo_shape name xorg stop_no mccid_str xsource xdate mccid_int
0 -37.82023778673241, 144.95786314283018 {"coordinates": [144.95786314283018, -37.82023... Melbourne Aquarium / Flinders Street GIS Team 2 NaN Mapbase 2011-10-18 4
1 -37.82097269970027, 144.95546153614245 {"coordinates": [144.95546153614245, -37.82097... Spencer Street / Flinders Street GIS Team 1 NaN Mapbase 2011-10-18 5
2 -37.82190465062153, 144.95109855638137 {"coordinates": [144.95109855638137, -37.82190... The Goods Shed / Wurundjeri Way GIS Team D5 NaN Mapbase 2011-10-18 7
3 -37.811771476718356, 144.95644059700524 {"coordinates": [144.95644059700524, -37.81177... William Street / La Trobe Street GIS Team 3 NaN Mapbase 2011-10-18 16
4 -37.81105928060848, 144.95891745116262 {"coordinates": [144.95891745116262, -37.81105... Queen Street / La Trobe Street GIS Team 4 NaN Mapbase 2011-10-18 17
5 -37.80961884837298, 144.96384957029932 {"coordinates": [144.96384957029932, -37.80961... Swanston Street / La Trobe Street GIS Team 6 NaN Mapbase 2011-10-18 19
6 -37.808876998255194, 144.96634474519394 {"coordinates": [144.96634474519394, -37.80887... Russell Street / La Trobe Street GIS Team 7 NaN Mapbase 2011-10-18 20
7 -37.81358116790275, 144.97406360491075 {"coordinates": [144.97406360491075, -37.81358... Parliament / Collins Street GIS Team 8 NaN Mapbase 2011-10-18 25
8 -37.8176316450406, 144.96690455927876 {"coordinates": [144.96690455927876, -37.81763... Swanston Street / Flinders Street GIS Team 5 NaN Mapbase 2011-10-18 1
9 -37.818324403770184, 144.964479208357 {"coordinates": [144.964479208357, -37.8183244... Elizabeth Street / Flinders Street GIS Team 4 NaN Mapbase 2011-10-18 2

The tram_stops_df dataset includes information about tram stop locations in Melbourne. It contains columns like geo_point_2d, which represents the latitude and longitude of each stop, and geo_shape, which holds the geospatial shape data in JSON format. Other columns, such as name, provide the tram stop’s description, while stop_no represents its unique identifier. The dataset also includes metadata like the data source (xsource), collection date (xdate), and identifiers in both string and integer formats (mccid_str and mccid_int). This data can be used for analyzing tram stop locations and other related insights.

In [27]:
tram_stops_df.shape
Out[27]:
(28, 9)

The shape of the tram_stops_df dataset is (28, 9), it has 128 rows and 9 columns.

In [29]:
tram_stops_df.nunique()
Out[29]:
geo_point_2d    28
geo_shape       28
name            28
xorg             1
stop_no         18
mccid_str        0
xsource          1
xdate            1
mccid_int       28
dtype: int64
In [13]:
# Split 'geo_point_2d' into 'latitude' and 'longitude'
tram_stops_df[['latitude', 'longitude']] = tram_stops_df['geo_point_2d'].str.split(', ', expand=True)

# Drop 'geo_shape', 'geo_point_2d','xorg', 'stop_no', 'mccid_str', 'xsource', 'axdate', and'mccid_int' columns
tram_stops_df = tram_stops_df.drop(columns=['geo_shape', 'geo_point_2d','xorg', 'stop_no', 'mccid_str', 'xsource', 'xdate', 'mccid_int', 'name'])

#Added Stop Type column
tram_stops_df['stop_type'] = 'Tram Stop'

tram_stops_df.head(10)
Out[13]:
latitude longitude stop_type
0 -37.82023778673241 144.95786314283018 Tram Stop
1 -37.82097269970027 144.95546153614245 Tram Stop
2 -37.82190465062153 144.95109855638137 Tram Stop
3 -37.811771476718356 144.95644059700524 Tram Stop
4 -37.81105928060848 144.95891745116262 Tram Stop
5 -37.80961884837298 144.96384957029932 Tram Stop
6 -37.808876998255194 144.96634474519394 Tram Stop
7 -37.81358116790275 144.97406360491075 Tram Stop
8 -37.8176316450406 144.96690455927876 Tram Stop
9 -37.818324403770184 144.964479208357 Tram Stop

The below actions performed to clean the data set

Splitting geo_point_2d into latitude and longitude: This extracts the latitude and longitude values from the geo_point_2d column, which is in the format of a string like "-37.82023778673241, 144.95786314283018", and assigns them to new columns latitude and longitude.

Dropping unnecessary columns: Columns like geo_shape, geo_point_2d, xorg, stop_no, mccid_str, xsource, xdate, and mccid_int are dropped as they are no longer needed for analysis.

A new column called stop_type is added to the dataset, with the value 'Tram Stop' for all rows, indicating that these are tram stop locations.

Climate Data.¶

I performed different data cleaning methods.

  • Droped 'device_id', 'received_at','sensorlocation', and 'latlong' columns.
  • Created 'latitude' and 'longitude' columns from 'latlong' column.
  • Filled Null values.
In [31]:
climate_df.head(10)
Out[31]:
device_id received_at sensorlocation latlong minimumwinddirection averagewinddirection maximumwinddirection minimumwindspeed averagewindspeed gustwindspeed airtemperature relativehumidity atmosphericpressure pm25 pm10 noise
0 ICTMicroclimate-08 2025-02-09T00:54:37+00:00 Swanston St - Tram Stop 13 adjacent Federation... -37.8184515, 144.9678474 0.0 153.0 358.0 0.0 3.9 7.9 23.9 57.300000 1009.7 0.0 0.0 80.500000
1 ICTMicroclimate-11 2025-02-09T01:02:11+00:00 1 Treasury Place -37.812888, 144.9750857 0.0 144.0 356.0 0.0 2.0 7.8 24.5 56.200000 1005.3 0.0 0.0 62.900000
2 ICTMicroclimate-05 2025-02-09T01:03:24+00:00 Enterprize Park - Pole ID: COM1667 -37.8204083, 144.9591192 0.0 45.0 133.0 0.0 1.5 2.7 25.0 60.000000 1009.6 1.0 3.0 68.500000
3 ICTMicroclimate-01 2025-02-09T01:02:43+00:00 Birrarung Marr Park - Pole 1131 -37.8185931, 144.9716404 NaN 150.0 NaN NaN 1.6 NaN 23.1 61.099998 1009.0 0.0 5.0 51.700001
4 ICTMicroclimate-09 2025-02-09T01:17:37+00:00 SkyFarm (Jeff's Shed). Rooftop - Melbourne Con... -37.8223306, 144.9521696 0.0 241.0 359.0 0.0 0.9 4.4 25.6 53.700000 1007.9 0.0 0.0 60.200000
5 ICTMicroclimate-05 2025-02-09T01:18:26+00:00 Enterprize Park - Pole ID: COM1667 -37.8204083, 144.9591192 0.0 357.0 32.0 1.6 1.9 2.2 24.5 58.700000 1009.3 1.0 3.0 68.700000
6 ICTMicroclimate-02 2025-02-09T01:26:51+00:00 101 Collins St L11 Rooftop -37.814604, 144.9702991 0.0 357.0 359.0 0.0 0.5 1.4 26.6 51.800000 1004.7 1.0 3.0 69.200000
7 ICTMicroclimate-07 2025-02-09T01:35:39+00:00 Tram Stop 7C - Melbourne Tennis Centre Precinc... -37.8222341, 144.9829409 0.0 91.0 356.0 0.0 0.9 4.4 26.6 49.200000 1011.3 0.0 0.0 64.500000
8 ICTMicroclimate-01 2025-02-09T01:32:44+00:00 Birrarung Marr Park - Pole 1131 -37.8185931, 144.9716404 NaN 143.0 NaN NaN 1.9 NaN 23.9 59.599998 1008.5 0.0 5.0 53.200001
9 ICTMicroclimate-04 2025-02-09T01:38:22+00:00 Batman Park -37.8221828, 144.9562225 0.0 10.0 356.0 0.0 1.5 6.9 26.5 51.800000 1011.9 0.0 0.0 72.300000

The climate_df dataframe contains data from environmental sensors located in different parts of Melbourne, tracking various climate variables. This dataset will be used to analyze environmental conditions in different areas over time.

In [170]:
climate_df.shape
Out[170]:
(330961, 16)

The climate_df dataframe has a shape of (294602, 16), meaning it contains 294,602 rows and 16 columns. This indicates that there are 294,602 individual climate measurements across the 16 features recorded by the sensors.

In [32]:
climate_df.nunique()
Out[32]:
device_id                   12
received_at             331169
sensorlocation              11
latlong                     12
minimumwinddirection       360
averagewinddirection       360
maximumwinddirection       361
minimumwindspeed           409
averagewindspeed           102
gustwindspeed              304
airtemperature             543
relativehumidity          1551
atmosphericpressure       1644
pm25                       528
pm10                       119
noise                     1115
dtype: int64
In [14]:
# Changing 'received_at' is in datetime format
climate_df['received_at'] = pd.to_datetime(climate_df['received_at'])

# Creating separate columns for date and time
climate_df['date'] = climate_df['received_at'].dt.date
climate_df['time'] = climate_df['received_at'].dt.time

#Filling Null value
climate_df.fillna(0, inplace=True)

# Splitting 'latlong' into 'latitude' and 'longitude'
climate_df[['latitude', 'longitude']] = climate_df['latlong'].str.split(', ', expand=True)

# Dropping 'device_id', 'received_at','sensorlocation', and 'latlong' columns
climate_df = climate_df.drop(columns=['device_id', 'received_at', 'latlong'])

climate_df['hour'] = pd.to_datetime(climate_df['time'], format='%H:%M:%S').dt.hour

# Printing the results
climate_df.head()
Out[14]:
sensorlocation minimumwinddirection averagewinddirection maximumwinddirection minimumwindspeed averagewindspeed gustwindspeed airtemperature relativehumidity atmosphericpressure pm25 pm10 noise date time latitude longitude hour
0 Swanston St - Tram Stop 13 adjacent Federation... 0.0 153.0 358.0 0.0 3.9 7.9 23.9 57.300000 1009.7 0.0 0.0 80.500000 2025-02-09 00:54:37 -37.8184515 144.9678474 0
1 1 Treasury Place 0.0 144.0 356.0 0.0 2.0 7.8 24.5 56.200000 1005.3 0.0 0.0 62.900000 2025-02-09 01:02:11 -37.812888 144.9750857 1
2 Enterprize Park - Pole ID: COM1667 0.0 45.0 133.0 0.0 1.5 2.7 25.0 60.000000 1009.6 1.0 3.0 68.500000 2025-02-09 01:03:24 -37.8204083 144.9591192 1
3 Birrarung Marr Park - Pole 1131 0.0 150.0 0.0 0.0 1.6 0.0 23.1 61.099998 1009.0 0.0 5.0 51.700001 2025-02-09 01:02:43 -37.8185931 144.9716404 1
4 SkyFarm (Jeff's Shed). Rooftop - Melbourne Con... 0.0 241.0 359.0 0.0 0.9 4.4 25.6 53.700000 1007.9 0.0 0.0 60.200000 2025-02-09 01:17:37 -37.8223306 144.9521696 1

The received_at column was converted to a datetime format and splits it into separate date and time columns, and filled any missing values with 0, then splits the latlong column into latitude and longitude. Unnecessary columns like device_id, received_at, sensorlocation, and latlong were dropped to simplify the dataframe. The result is a clean dataframe with the required information for analysis.

Data Visualisation - Pedestrian Counting System Data.¶

Plotting the Time Series of Pedestrian Count

In [15]:
# Plotting Daily Pedestrian Distribution
pedestrian_df_N.plot(
    x='Date',
    y='pedestriancount',
    figsize=(12, 4),
    title='Daily Pedestrian Count',
    legend=False,
    color="#ffb84d"
)

plt.xlabel("Date")
plt.ylabel("Pedestrian Count")
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

According to above Time series plot

  • There seems to be an overall increasing trend, suggesting that pedestrian activity has generally risen over time.

  • Some periodic fluctuations may indicate seasonal effects such as weekends, holidays, or special events.

  • Some days show extreme spikes in pedestrian traffic.

  • These could correspond to events, festivals, or special occasions that caused a surge in foot traffic.

Plotting Monthly Pedestrian Count

In [16]:
# Making a copy of the data
mom_df = pedestrian_df_N.copy()

# Convert to datetime
mom_df["Date"] = pd.to_datetime(mom_df["Date"], errors='coerce')

# Extract year and month
mom_df["year_month"] = mom_df["Date"].dt.to_period("M")
mom_df["year"] = mom_df["Date"].dt.year

# Remove current month
current_year_month = datetime.datetime.today().strftime("%Y-%m")
mom_df = mom_df[mom_df["year_month"].astype(str) != current_year_month]

# Sum counts per month
monthly_counts = mom_df.groupby(["year_month", "year"])["pedestriancount"].sum().reset_index()
monthly_counts["year_month"] = monthly_counts["year_month"].astype(str)

# Plotting 
plt.figure(figsize=(12, 6))

for year in sorted(monthly_counts["year"].unique()):
    year_data = monthly_counts[monthly_counts["year"] == year]
    plt.plot(year_data["year_month"], year_data["pedestriancount"],
             marker='o', linestyle='-', label=str(year), color="#ffb84d")

plt.xlabel("Month")
plt.ylabel("Total Pedestrian Count")
plt.title("Monthly Pedestrian Count")
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

According to above plot

  • The pedestrian count generally increases year over year.This suggests growing foot traffic in urban areas, possibly due to economic recovery, better infrastructure, or population growth.

  • Each year shows periodic rises and falls in pedestrian counts. Possible factors affecting trends:

    • Weather: Cold months may have lower pedestrian activity.

    • Events & Holidays: Some peaks may correspond to major city events.

    • Work & School Cycles: Summer vacations and holiday periods might show variations.

  • Post-Pandemic Recovery (2021 - 2022). 2021 starts with a low count, likely due to lingering COVID-19 restrictions.2022 shows rapid growth, indicating a return to normal foot traffic levels.

  • Recent Trends (2024 - 2025). Slight fluctuations in 2024, but pedestrian counts remain relatively high. Early 2025 shows a peak, possibly indicating an ongoing upward trend.

Plotting Hourly Pedestrian Count

In [17]:
# Aggregating pedestrian count by hour of the day
hourly_counts = pedestrian_df_N.groupby("hourday")["pedestriancount"].sum().reset_index()

# Identifying peak hours and off peak hours
top_hours = hourly_counts.nlargest(6, "pedestriancount")["hourday"]
lowest_hours = hourly_counts.nsmallest(7, "pedestriancount")["hourday"]

# Normalizing the counts for color intensity (higher count = darker)
norm = mcolors.Normalize(vmin=hourly_counts["pedestriancount"].min(), vmax=hourly_counts["pedestriancount"].max())

# Definning colors
yellow_shades = [ "#FFF2CC",   "#FFEB99", "#FFDD66", "#FFCC33", "#FFB800",]

num_shades = len(yellow_shades)
colors = [yellow_shades[int(norm(count) * (num_shades - 1))] for count in hourly_counts["pedestriancount"]]

# Plotting the bar chart
plt.figure(figsize=(8, 6))
bars = plt.bar(hourly_counts["hourday"], hourly_counts["pedestriancount"], color=colors)

plt.xlabel("Hour of the Day")
plt.ylabel("Total Pedestrian Count")
plt.title("Pedestrian Count by Hour of the Day")
plt.xticks(range(0, 24))
plt.grid(axis='y', linestyle='--')

# Creating the legend
legend_labels = [
    Patch(color=yellow_shades[0], label="Low Pedestrian Count"),
    Patch(color=yellow_shades[-1], label="High Pedestrian Count")
]

plt.legend(handles=legend_labels, loc='upper right')

plt.tight_layout()
plt.show()
No description has been provided for this image

According to above bar chart

  • Peak Hours (Red Bars) occurs mostly in the afternoon to early evening around 12 PM to 5 PM. This suggests that pedestrian footfall is highest around lunchtime and evening rush hours.

  • Off-Peak Hours (Green Bars) are seen during the late-night and early morning hours between 12 AM to 6 AM, where pedestrian activity is at its lowest.

  • Moderate Traffic Hours (Light Orange Bars) show an increasing trend in the morning and then a gradual decline in the evening after peak hours.

Plotting Pedestrians count by Weekday

In [18]:
# Coverting 'Date' column to datetime format
pedestrian_df_N['Date'] = pd.to_datetime(pedestrian_df_N['Date'], errors='coerce')
pedestrian_df_N['Weekday'] = pedestrian_df_N['Date'].dt.day_name()

# Group by weekday and sum pedestrian counts
weekday_counts = pedestrian_df_N.groupby('Weekday')['pedestriancount'].sum()
weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekday_counts = weekday_counts.reindex(weekday_order)

# Calculating percentages
weekday_percent = (weekday_counts / weekday_counts.sum() * 100).round(1)

norm = mcolors.Normalize(vmin=weekday_counts.min(), vmax=weekday_counts.max())

# Definning colors
yellow_shades = ["#FFF2CC", "#FFEB99", "#FFDD66", "#FFCC33","#FFB800",]

num_shades = len(yellow_shades)
colors = [yellow_shades[int(norm(count) * (num_shades - 1))] for count in weekday_counts.values]

# Plotting
plt.figure(figsize=(6, 4))
ax = sns.barplot(x=weekday_counts.index, y=weekday_counts.values, palette=colors)

for i, (value, percent) in enumerate(zip(weekday_counts.values, weekday_percent.values)):
    ax.text(i, value + 100, f'{percent}%', ha='center', fontsize=9)

plt.ylabel("Total Pedestrian Count")
plt.xlabel("Weekday")
plt.title("Pedestrian Count by Weekday")
plt.tight_layout()
plt.show()
No description has been provided for this image

According to above bar chart Fridays and Saturdays are the busiest days and sundays and mondays are the most less busiest days of the weekday.

Plotting Pedestrian Counts by Street

In [19]:
# Group and sort pedestrian counts by street
street_counts = pedestrian_df_N.groupby('Street')['pedestriancount'].sum().sort_values(ascending=False)

# Calculating percentages
total_count = street_counts.sum()
street_percent = (street_counts / total_count * 100).round(1)

# Normalizing
norm = mcolors.Normalize(vmin=street_counts.min(), vmax=street_counts.max())

# Definning colors
yellow_shades = [ "#FFF2CC", "#FFEB99", "#FFDD66", "#FFCC33", "#FFB800",]

num_shades = len(yellow_shades)
colors = [yellow_shades[int(norm(count) * (num_shades - 1))] for count in street_counts.values]

# Plotting the bar chart
plt.figure(figsize=(12, 6))
ax = sns.barplot(x=street_counts.values, y=street_counts.index, palette=colors)

# Add percentage labels
for i, (value, percent) in enumerate(zip(street_counts.values, street_percent.values)):
    ax.text(value + 50, i, f'{percent}%', va='center', fontsize=9)

plt.xlabel("Total Pedestrian Count")
plt.ylabel("Street")
plt.title("Total Pedestrian Count by Street")
plt.tight_layout()
plt.show()
No description has been provided for this image

The Swanston Street and Flinders Street are the Busiest streets.

Plotting weekday wise Buisest streets.

In [22]:
# Create 'Weekday' column from 'Date'
pedestrian_df_N['Date'] = pd.to_datetime(pedestrian_df_N['Date'])
pedestrian_df_N['Weekday'] = pedestrian_df_N['Date'].dt.day_name()

weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday','Sunday',]
weekday_dtype = pd.CategoricalDtype(categories=weekday_order, ordered=True)

pedestrian_df_N['Weekday'] = pedestrian_df_N['Weekday'].astype(weekday_dtype)

# Group by Weekday and Street
weekday_street_counts = pedestrian_df_N.groupby(['Weekday', 'Street']).size().reset_index(name='Count')

num_categories = weekday_street_counts['Weekday'].nunique()
nrows = (num_categories // 3) + 1
ncols = 3

fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(15, 5 * nrows))
axes = axes.flatten()

X = weekday_street_counts.groupby('Weekday', sort=False) 
num = 0
for category, group in X:
    df = pd.DataFrame(group)  
    top_5_streets = df.nlargest(5, 'Count')

    x_labels = top_5_streets['Street'].values
    y_values = top_5_streets['Count'].values

    # Plotting
    ax = axes[num]
    bars = ax.bar(x_labels, y_values, color='#ffdd99')
    ax.set_title(f'Top 5 Streets for {category}')
    ax.set_xlabel('Street')
    ax.set_ylabel('Pedestrian Count')
    ax.set_xticks(range(len(x_labels)))
    ax.set_xticklabels(x_labels, rotation=90)
    
    total = y_values.sum()
    for bar, count in zip(bars, y_values):
        height = bar.get_height()
        percentage = f'{(count / total * 100):.0f}%'
        ax.annotate(percentage, xy=(bar.get_x() + bar.get_width() / 2, height), 
                    xytext=(0, 3), textcoords="offset points", ha='center', va='top')
    num += 1

for i in range(num, len(axes)):
    axes[i].axis('off')

plt.tight_layout()
plt.show()
C:\Users\chath\AppData\Local\Temp\ipykernel_28860\2397580260.py:11: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  weekday_street_counts = pedestrian_df_N.groupby(['Weekday', 'Street']).size().reset_index(name='Count')
C:\Users\chath\AppData\Local\Temp\ipykernel_28860\2397580260.py:20: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  X = weekday_street_counts.groupby('Weekday', sort=False)
No description has been provided for this image

According to the above multiple bar charts the Swanston Street and Flinders Street are the Busiest streets for all days.

In [30]:
# Convert coordinates to numeric
pedestrian_df_N['latitude'] = pd.to_numeric(pedestrian_df_N['latitude'], errors='coerce')
pedestrian_df_N['longitude'] = pd.to_numeric(pedestrian_df_N['longitude'], errors='coerce')

# Filter for 2024 and drop missing values
recent_df = pedestrian_df_N[pedestrian_df_N['Date'].dt.year == 2024].dropna(subset=['latitude', 'longitude'])

# Prepare data for HeatMap
data = recent_df[['latitude', 'longitude', 'pedestriancount']].values.tolist()

# Set map center
map_center = [recent_df['latitude'].mean(), recent_df['longitude'].mean()]
Heat_map = folium.Map(location=map_center, zoom_start=13)

# Add HeatMap layer
HeatMap(data, radius=15, blur=10, max_zoom=1).add_to(Heat_map)

# Add title
title_html = """
<h3 style="text-align: center; margin: 10px 0;">Pedestrian Density Heatmap</h3>
"""
Heat_map.get_root().html.add_child(folium.Element(title_html))

# Display the map
display(Heat_map)
Make this Notebook Trusted to load map: File -> Trust Notebook

The above heat map visualise the pedestrains density of the Melbourne City

Data Visualisation - Tree Canopies Data.¶

In [35]:
# Create latitude and longitude as numeric
tree_canopies_df['latitude'] = pd.to_numeric(tree_canopies_df['latitude'], errors='coerce')
tree_canopies_df['longitude'] = pd.to_numeric(tree_canopies_df['longitude'], errors='coerce')

# Dropping rows with invalid coordinates
tree_canopies_df = tree_canopies_df.dropna(subset=['latitude', 'longitude'])

# Create folium map
map = folium.Map(
    location=[tree_canopies_df['latitude'].mean(), tree_canopies_df['longitude'].mean()],
    zoom_start=14
)

heat_data = [
    [row['latitude'], row['longitude'], row['shape_area']] 
    for index, row in tree_canopies_df.iterrows()
]

HeatMap(
    heat_data, 
    min_opacity=0.4,
    radius=15,
    blur=20,
    max_zoom=1,
    gradient={0.2: 'lightgreen', 0.5: 'green', 0.8: 'darkgreen'}
).add_to(map)

# Add title
title_html = """
<h3 style="text-align: center; margin: 10px 0;">Tree Canopy Density Heatmap</h3>
"""
map.get_root().html.add_child(folium.Element(title_html))

# Displaying map
display(map)
Make this Notebook Trusted to load map: File -> Trust Notebook

The above heat map displays tree canopy density of the city of Melbourne.

Data Visualisation - Bus Stops and Tram Stops Data.¶

Creating a single data frame to dispaly Bus stops and Tram stops.

In [37]:
stop_df = pd.concat([ tram_stops_df,bus_stops_df], ignore_index=True)
stop_df.head(10)
Out[37]:
latitude longitude stop_type
0 -37.82023778673241 144.95786314283018 Tram Stop
1 -37.82097269970027 144.95546153614245 Tram Stop
2 -37.82190465062153 144.95109855638137 Tram Stop
3 -37.811771476718356 144.95644059700524 Tram Stop
4 -37.81105928060848 144.95891745116262 Tram Stop
5 -37.80961884837298 144.96384957029932 Tram Stop
6 -37.808876998255194 144.96634474519394 Tram Stop
7 -37.81358116790275 144.97406360491075 Tram Stop
8 -37.8176316450406 144.96690455927876 Tram Stop
9 -37.818324403770184 144.964479208357 Tram Stop
In [39]:
# Clean latitude and longitude columns
stop_df['latitude'] = pd.to_numeric(stop_df['latitude'], errors='coerce')
stop_df['longitude'] = pd.to_numeric(stop_df['longitude'], errors='coerce')
stop_df = stop_df.dropna(subset=['latitude', 'longitude'])

if 'stop_name' not in stop_df.columns:
    stop_df['stop_name'] = stop_df['stop_type'].str.capitalize()

map_center = [stop_df['latitude'].mean(), stop_df['longitude'].mean()]
m = folium.Map(location=map_center, zoom_start=13)

bus_layer = folium.FeatureGroup(name='Bus Stops')
tram_layer = folium.FeatureGroup(name='Tram Stops')

for _, row in stop_df.iterrows():
    if row['stop_type'] == 'Bus Stop':
        icon = folium.Icon(color='orange', icon='bus', prefix='fa')
        bus_layer.add_child(
            folium.Marker(
                location=[row['latitude'], row['longitude']],
                icon=icon
            )
        )
    else:
        icon = folium.Icon(color='green', icon='train', prefix='fa')
        tram_layer.add_child(
            folium.Marker(
                location=[row['latitude'], row['longitude']],
                icon=icon
            )
        )

# Add layers to map
m.add_child(bus_layer)
m.add_child(tram_layer)

legend_html = """
<div style="position: fixed; 
            bottom: 30px; left: 30px; 
            width: 150px; height: 100px; 
            background-color: white; 
            border: 2px solid grey; 
            z-index: 9999; font-size: 14px;
            padding: 10px;">
<b>Legend</b><br>
<i style="background-color: orange; width: 15px; height: 15px; display: inline-block;"></i> Bus Stop<br>
<i style="background-color: green; width: 15px; height: 15px; display: inline-block;"></i> Tram Stop
</div>
"""
m.get_root().html.add_child(folium.Element(legend_html))

title_html = """
<h3 style="text-align: center; margin: 10px 0;">Public Transport Stops</h3>
"""
m.get_root().html.add_child(folium.Element(title_html))

# Displaying the map
display(m)
Make this Notebook Trusted to load map: File -> Trust Notebook

The above map dispalys the public transport stops with in the city of Melbourne area.

Data Visualisation - Climate Data.¶

Plotting hourly distribution of climate variables

In [40]:
# Group the data by hour and calculate mean
hourly_avg = climate_df.groupby('hour')[['airtemperature', 'pm25', 'pm10', 'noise']].mean()

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle('Hourly Averages of Climate Variables', fontsize=16)

axes = axes.flatten()

# Plotting 
for i, column in enumerate(hourly_avg.columns):
    hourly_avg[column].plot(
        kind='bar',
        ax=axes[i],
        title=column.capitalize(),
        color='#ffdd99'
    )
    axes[i].set_xlabel('Hour')
    axes[i].set_ylabel(column.capitalize())
    axes[i].tick_params(axis='x', rotation=45)

plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()
No description has been provided for this image

Temperature is highest between 3 AM and 6 AM, peaking just over 21°C. It decreases steadily throughout the day, reaching the lowest around 17–18 PM. This pattern may reflect overnight heat retention and daytime cooling, possibly due to cloud cover or local microclimate effects.

PM2.5 levels are relatively stable, around 20–22 units.

Slight dip around 4–6 AM, and then slight increase during late morning and evening hours.

PM10 levels show more variation. Lowest levels between 3–6 AM and Sharp increase after 6 AM, peaking around 9–10 AM, then stabilizing.

Noise levels also didn't have much variations through out thea day.

In [41]:
# Convert date column to datetime
climate_df['date'] = pd.to_datetime(climate_df['date'])

# Create 'weekday' column 
climate_df['weekday'] = climate_df['date'].dt.day_name()
weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Computing weekday averages 
weekday_avg = climate_df.groupby('weekday')[['airtemperature', 'pm25', 'pm10', 'noise']].mean().loc[weekday_order]

fig, axes = plt.subplots(2, 2, figsize=(14, 8))
fig.suptitle('Weekday Averages of Climate Variables', fontsize=16)

axes = axes.flatten()

# Plotting each variable
for i, column in enumerate(weekday_avg.columns):
    weekday_avg[column].plot(
        kind='bar',
        ax=axes[i],
        title=column.capitalize(),
        color='#ffdd99'
    )
    axes[i].set_xticklabels(weekday_order, rotation=45)
    axes[i].set_ylabel(column.capitalize())

plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()
No description has been provided for this image

According to above plots Air temperature seems to remain stable across the week.

PM2.5 and PM10 increase on weekdays, especially mid-to-late week, indicating possible pollution buildup or activity-related emissions.

Noise remains high and stable, suggesting a consistently active urban environment.

Weather Index Calculation

The weather index was calculated to evaluate pedestrian comfort using climate data. Both the pedestrian and climate datasets are filtered to include only records for the most recent and relevant data. Next, include only the variables needed to assess weather comfort such as, air temperature, average and gust wind speeds, relative humidity, and environmental factors such as PM2.5, PM10, and noise levels.

In [43]:
# Filter for 2024 data
pedestrian_2024 = pedestrian_df_N[pedestrian_df_N['Date'].dt.year == 2024]
climate_2024 = climate_df[climate_df['date'].dt.year == 2024]

# Prepare subset
climate_subset = climate_2024[['date', 'hour', 'latitude', 'longitude',
                               'airtemperature', 'averagewindspeed', 'gustwindspeed',
                               'relativehumidity', 'pm25', 'pm10', 'noise']]

# Merge data set
merged = pd.merge(pedestrian_2024, climate_subset,
                  left_on=['Date', 'hourday'], right_on=['date', 'hour'],
                  how='inner')

The weather index was calculated using a custom function that scores temperature, wind, and humidity on a scale from 0 to 1. The ideal temperature is considered to be 21.5°C, with deviations reducing the score. Similarly, low wind speeds (preferably below 3 m/s) and moderate humidity levels (close to 50%) are considered optimal. These three components are then weighted as 40% for temperature, and 30% each for wind and humidity to produce a final weather index value. This index is then applied to every row in the merged dataset.

To make the index more interpretable, the numeric values are categorized into four qualitative levels as Stressful (0.0–0.3), Tolerable (0.3–0.6), Comfortable (0.6–0.8), and Ideal (0.8–1.0)

In [44]:
# Define the index function
def compute_weather_index(temp, wind, humidity):
    temp_score = max(0, 1 - abs(temp - 21.5) / 10)        # Peak comfort at 21.5°C
    wind_score = max(0, 1 - wind / 6)                     # Comfortable if < 3 m/s
    humidity_score = max(0, 1 - abs(humidity - 50) / 50)  # Ideal ~50%
    return round((0.4 * temp_score + 0.3 * wind_score + 0.3 * humidity_score), 2)

# Apply to dataset
merged['weather_index'] = merged.apply(
    lambda row: compute_weather_index(row['airtemperature'],
                                      row['averagewindspeed'],
                                      row['relativehumidity']), axis=1)

merged['weather_category'] = pd.cut(merged['weather_index'], 
                                    bins=[0, 0.3, 0.6, 0.8, 1.0], 
                                    labels=['Stressful', 'Tolerable', 'Comfortable', 'Ideal'])
merged.head()
Out[44]:
Date hourday pedestriancount sensor_name latitude_x longitude_x Street Weekday date hour ... longitude_y airtemperature averagewindspeed gustwindspeed relativehumidity pm25 pm10 noise weather_index weather_category
0 2024-09-17 17 2167 Eli250_T -37.812585 144.962578 Elizabeth St Tuesday 2024-09-17 17 ... 144.9519007 7.5 0.0 0.1 76.9 0.0 0.0 0.0 0.44 Tolerable
1 2024-09-17 17 2167 Eli250_T -37.812585 144.962578 Elizabeth St Tuesday 2024-09-17 17 ... 144.9521696 9.6 0.6 2.4 59.5 3.0 3.0 58.9 0.51 Tolerable
2 2024-09-17 17 2167 Eli250_T -37.812585 144.962578 Elizabeth St Tuesday 2024-09-17 17 ... 144.96728 10.6 1.4 2.4 58.1 13.0 15.0 71.2 0.48 Tolerable
3 2024-09-17 17 2167 Eli250_T -37.812585 144.962578 Elizabeth St Tuesday 2024-09-17 17 ... 144.9702991 10.6 2.0 2.3 60.3 2.0 4.0 69.8 0.44 Tolerable
4 2024-09-17 17 2167 Eli250_T -37.812585 144.962578 Elizabeth St Tuesday 2024-09-17 17 ... 144.9521696 9.5 0.3 1.7 59.7 3.0 3.0 57.2 0.53 Tolerable

5 rows × 21 columns

The duplictes were removed from the output variable

In [45]:
# Remove rows where 'Street' is NaN
ped_df_clean = merged.dropna(subset=['Street'])

# Remove duplicates based on 'date', 'hour', 'latitude_x', and 'longitude_x'
ped_df_N = ped_df_clean.drop_duplicates(subset=['date', 'hour', 'latitude_x', 'longitude_x'])

ped_df_N = ped_df_N.reset_index(drop=True)

ped_df_N.head()
Out[45]:
Date hourday pedestriancount sensor_name latitude_x longitude_x Street Weekday date hour ... longitude_y airtemperature averagewindspeed gustwindspeed relativehumidity pm25 pm10 noise weather_index weather_category
0 2024-09-17 17 2167 Eli250_T -37.812585 144.962578 Elizabeth St Tuesday 2024-09-17 17 ... 144.9519007 7.5 0.0 0.1 76.9 0.0 0.0 0.0 0.44 Tolerable
1 2024-09-17 17 513 HarEsP_T -37.814414 144.944330 Harbour Esplanade Tuesday 2024-09-17 17 ... 144.9519007 7.5 0.0 0.1 76.9 0.0 0.0 0.0 0.44 Tolerable
2 2024-09-17 17 111 574Qub_T -37.803100 144.949081 Queensberry St Tuesday 2024-09-17 17 ... 144.9519007 7.5 0.0 0.1 76.9 0.0 0.0 0.0 0.44 Tolerable
3 2024-09-17 17 754 Lon189_T -37.811219 144.966568 Lonsdale St Tuesday 2024-09-17 17 ... 144.9519007 7.5 0.0 0.1 76.9 0.0 0.0 0.0 0.44 Tolerable
4 2024-09-17 17 312 Lyg161_T -37.801697 144.966589 Lygon St Tuesday 2024-09-17 17 ... 144.9519007 7.5 0.0 0.1 76.9 0.0 0.0 0.0 0.44 Tolerable

5 rows × 21 columns

Stress Index Calculation

Stress Index was calculated based on the Canopy Coverage ratio was calcualted to quantifies how much pedestrian activity occurs under the shade of trees. This begins with cleaning and converting the raw spatial data representing tree canopies. Then a GeoDataFrame (canopies_gdf) was created with a standard geographic coordinate reference system. Further pedestrian records also converted into point geometries based on each sensor’s latitude and longitude coordinates.

In [57]:
# Clean up geo_shape and convert to geometry
def safe_shape_parser(x):
    try:
        if pd.isnull(x):
            return None
        return shape(json.loads(x))
    except (ValueError, TypeError, json.JSONDecodeError):
        return None

tree_canopies_df['geometry'] = tree_canopies_df['geo_shape'].apply(safe_shape_parser)

# Drop rows with invalid geometries 
tree_canopies_df = tree_canopies_df.dropna(subset=['geometry'])

# Create GeoDataFrame
canopies_gdf = gpd.GeoDataFrame(tree_canopies_df, geometry='geometry', crs="EPSG:4326")

# Create Point geometries from sensor latitude and longitude
ped_df_N['geometry'] = ped_df_N.apply(lambda row: Point(row['longitude_x'], row['latitude_x']), axis=1)

# Convert to GeoDataFrame
pedestrian_gdf = gpd.GeoDataFrame(ped_df_N, geometry='geometry', crs="EPSG:4326")

A spatial join is then performed to checks whether each pedestrian point falls within a tree canopy polygon. This spatial relationship is used to create a new binary variable under_canopy and set as 1 if the pedestrian location is within a canopy and 0 otherwise.

In [58]:
canopies_gdf = canopies_gdf.to_crs(epsg=4326)

pedestrian_with_canopy = gpd.sjoin(pedestrian_gdf, canopies_gdf[['geometry']], how='left', predicate='within')

# Add binary indicator: 1 if under tree canopy, else 0
pedestrian_with_canopy['under_canopy'] = pedestrian_with_canopy['index_right'].notnull().astype(int)

Then duplicate records were removed based on a combination of date, hour, latitude, and longitude.

In [59]:
pedestrian_with_canopy['index_right'] = pedestrian_with_canopy['index_right'].fillna(0).astype(int)

# Remove duplicates based on 'date', 'hour', 'latitude_x', and 'longitude_x'
ped_df = pedestrian_with_canopy.drop_duplicates(subset=['date', 'hour', 'latitude_x', 'longitude_x'])

# reset index after dropping duplicates
ped_df = ped_df.reset_index(drop=True)

Then cleaned dataset is grouped by street location to calculate two metrics for each street. The average pedestrian count and the mean canopy coverage ratio, which reflects the proportion of time pedestrian activity occurred under canopy cover. These two metrics are then scaled to a 0–1 range using Min-Max normalization, for easy comparison across streets regardless of the original scale of values.

In [64]:
# Group by sensor and calculate mean canopy presence
canopy_stats = ped_df.groupby('Street').agg({
    'pedestriancount': 'mean',
    'under_canopy': 'mean'
}).reset_index()

canopy_stats.rename(columns={
    'pedestriancount': 'avg_ped_count',
    'under_canopy': 'canopy_coverage_ratio'
}, inplace=True)

scaler = MinMaxScaler()
canopy_stats[['ped_scaled', 'canopy_scaled']] = scaler.fit_transform(
    canopy_stats[['avg_ped_count', 'canopy_coverage_ratio']]
)

Finally canopy metrics are merged into the pedestrian dataset (ped_df)

In [65]:
# Merging with the oedestrian data frame
ped_df = ped_df.merge(
    canopy_stats[['Street', 'ped_scaled', 'avg_ped_count', 'canopy_coverage_ratio', 'canopy_scaled']],
    on='Street', how='left'
)

The Stress Index was calculated by multiplies the scaled pedestrian count (ped_scaled) by (1 - canopy coverage), where canopy_scaled represents the normalized amount of tree cover. The idea is that stress is higher in places where many people are walking but little shade is available. This method ensures that high pedestrian activity combined with low canopy presence results in a high stress score.

To make the results interpretable, the computed stress index values are grouped into three categories: Low Stress, Moderate Stress, and High Stress. Thresholds are defined such that scores from 0 to 0.33 are considered low stress, values between 0.33 and 0.66 fall into moderate stress, and anything above 0.66 indicates high stress.

In [66]:
# Calculating Stress Index: high when pedestrian traffic is high and tree cover is low
ped_df ['stress_index'] = ped_df ['ped_scaled'] * (1 - ped_df ['canopy_scaled'])


# Definning thresholds for stress categories
stress_thresholds = {
    'Low Stress': (0, 0.33),      # Low stress when stress_index is between 0 and 0.33
    'Moderate Stress': (0.33, 0.66),  # Moderate stress when stress_index is between 0.33 and 0.66
    'High Stress': (0.66, 1)         # High stress when stress_index is between 0.66 and 1
}

# Function to categorize stress level based on stress_index
def categorize_stress(stress_value):
    if stress_value < 0.33:
        return 'Low Stress'
    elif stress_value < 0.66:
        return 'Moderate Stress'
    else:
        return 'High Stress'

# Apply categorization to the 'stress_index' column
ped_df['stress_category'] = ped_df['stress_index'].apply(categorize_stress)

# Displaying Results
ped_df.head()
Out[66]:
Date hourday pedestriancount sensor_name latitude_x longitude_x Street Weekday date hour ... ped_scaled_y avg_ped_count_y canopy_coverage_ratio_y canopy_scaled_y ped_scaled avg_ped_count canopy_coverage_ratio canopy_scaled stress_index stress_category
0 2024-09-17 17 2167 Eli250_T -37.812585 144.962578 Elizabeth St Tuesday 2024-09-17 17 ... 0.412988 606.984469 0.17599 0.17599 0.412988 606.984469 0.17599 0.17599 0.340306 Moderate Stress
1 2024-09-17 17 513 HarEsP_T -37.814414 144.944330 Harbour Esplanade Tuesday 2024-09-17 17 ... 0.098903 185.514538 0.00000 0.00000 0.098903 185.514538 0.00000 0.00000 0.098903 Low Stress
2 2024-09-17 17 111 574Qub_T -37.803100 144.949081 Queensberry St Tuesday 2024-09-17 17 ... 0.005616 60.333612 0.00000 0.00000 0.005616 60.333612 0.00000 0.00000 0.005616 Low Stress
3 2024-09-17 17 754 Lon189_T -37.811219 144.966568 Lonsdale St Tuesday 2024-09-17 17 ... 0.255010 394.994740 0.00000 0.00000 0.255010 394.994740 0.00000 0.00000 0.255010 Low Stress
4 2024-09-17 17 312 Lyg161_T -37.801697 144.966589 Lygon St Tuesday 2024-09-17 17 ... 0.102953 190.949573 0.00000 0.00000 0.102953 190.949573 0.00000 0.00000 0.102953 Low Stress

5 rows × 38 columns

The above output displays the pedestrian data with weather catogerisation based on weather index and stress catogerisation based on stress Index.

Plotting histograms to visualise the distribution and skewness of numerical variables

In [86]:
# Plot histograms with x and y labels and subtitles for numerical variables 
fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(15, 10))  
columns = ['pedestriancount', 'latitude_x', 'longitude_x', 'ped_scaled', 'weather_index',
           'avg_ped_count', 'canopy_coverage_ratio', 'stress_index']
axes = axes.ravel()  # Flatten the axes array for easy iteration

for idx, col in enumerate(columns):
    ped_df[col].hist(ax=axes[idx], bins=20, color='#ffdd99')  
    axes[idx].set_title(f'Histogram of {col}', fontsize=12)  
    axes[idx].set_xlabel(f'{col}')  # Set x-label to the column name
    axes[idx].set_ylabel('Frequency')  # Set y-label to 'Frequency'

# Set the overall plot title
fig.suptitle('Histograms for Numerical Variables - Before Transformation', fontsize=16)

# Adjust layout to prevent overlap
plt.tight_layout(rect=[0, 0, 1, 0.96])

# Show the plot
plt.show()
No description has been provided for this image
In [85]:
cols=['pedestriancount', 'latitude_x', 'longitude_x', 'ped_scaled', 'weather_index','avg_ped_count', 'canopy_coverage_ratio', 'stress_index']
skewness=ped_df[cols].skew()

# Print the skewness of each variable
print("Skewness of Variables:")
print(skewness)
Skewness of Variables:
pedestriancount          2.753057
latitude_x               0.913180
longitude_x             -0.965174
ped_scaled               1.376044
weather_index            0.236058
avg_ped_count            1.376044
canopy_coverage_ratio    2.401010
stress_index             1.534869
dtype: float64

The 'Pedestrian Counts' and 'Canopy coverage' columns are skewed.

Handling skewness by applying Yeo-Johnson Transformation

In [90]:
ped_df_copy = ped_df.copy()

# Features to apply Yeo-Johnson transformation (correcting skewness)
features = ['pedestriancount','ped_scaled', 'avg_ped_count', 'canopy_coverage_ratio', 'stress_index']

# Initialize Yeo-Johnson transformer
pt = PowerTransformer(method='yeo-johnson')

# Apply transformation and replace original columns with transformed columns
ped_df_copy[features] = pt.fit_transform(ped_df_copy[features])

Calculating Skewnewss of Variables -After Transformation

In [91]:
cols=['pedestriancount', 'latitude_x', 'longitude_x', 'ped_scaled', 'weather_index','avg_ped_count', 'canopy_coverage_ratio', 'stress_index']
skewness=ped_df_copy[cols].skew()

# Print the skewness of each variable
print("Skewness of Variables:")
print(skewness)
Skewness of Variables:
pedestriancount         -0.050484
latitude_x               0.913180
longitude_x             -0.965174
ped_scaled               0.023761
weather_index            0.236058
avg_ped_count           -0.013574
canopy_coverage_ratio    0.494559
stress_index             0.048626
dtype: float64

The skewness of the variables corrected by Yeo-Johnson transformer

Normalizing the Data set by MinMax Scalling

In [94]:
# Features to apply Scalling
features = ['pedestriancount', 'latitude_x', 'longitude_x', 'ped_scaled', 'weather_index','avg_ped_count', 'canopy_coverage_ratio', 'stress_index']

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Apply MinMax scaling
ped_df_copy[features] = scaler.fit_transform(ped_df_copy[features])
In [95]:
# Plot histograms with x and y labels and subtitles for numerical variables 
fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(15, 10))  
columns = ['pedestriancount', 'latitude_x', 'longitude_x', 'ped_scaled', 'weather_index',
           'avg_ped_count', 'canopy_coverage_ratio', 'stress_index']
axes = axes.ravel()  # Flatten the axes array for easy iteration

for idx, col in enumerate(columns):
    ped_df_copy[col].hist(ax=axes[idx], bins=20, color='#ffdd99')  
    axes[idx].set_title(f'Histogram of {col}', fontsize=12)  
    axes[idx].set_xlabel(f'{col}')  # Set x-label to the column name
    axes[idx].set_ylabel('Frequency')  # Set y-label to 'Frequency'

# Set the overall plot title
fig.suptitle('Histograms for Numerical Variables - AfterTransformation', fontsize=16)

# Adjust layout to prevent overlap
plt.tight_layout(rect=[0, 0, 1, 0.96])

# Show the plot
plt.show()
No description has been provided for this image

Walkability Score Calculation

By this code, the Walkability Score is calculated to assess how suitable and pleasant a given location and time are for walking. The score is constructed as a weighted combination of several factors that influence pedestrian comfort and safety:

Pedestrian Activity (ped_scaled) and Tree Canopy Coverage (canopy_scaled) both given a weight of 30%. This reflects the idea that areas with both high foot traffic and good tree coverage are generally more inviting for walking.

Stress Index is given with a weight of 20%, but in an inverse form (1 - stress_index), meaning that lower stress contributes positively to walkability.

Weather Index, which accounts for temperature, wind, and humidity comfort levels, gave the remaining 20%, emphasizing the importance of weather conditions on the walking experience.

Together, these components create a composite Walkability Score, ranging between 0 and 1.

Then classified each score into three categories as below:

High Walkability for scores above or equal to 0.66,

Moderate Walkability for scores between 0.33 and 0.66,

Low Walkability for scores below 0.33.

In [96]:
# Defining the walkability score
ped_df_copy['walkability_score'] = (
    0.3 * ped_df['ped_scaled'] +
    0.3 * ped_df['canopy_scaled'] +
    0.2 * (1 - ped_df['stress_index']) +
    0.2 * ped_df['weather_index']
)
def categorize_walkability(score):
    if score >= 0.66:
        return 'High Walkability'
    elif score >= 0.33:
        return 'Moderate Walkability'
    else:
        return 'Low Walkability'

ped_df_copy['walkability_category'] = ped_df['walkability_score'].apply(categorize_walkability)

# Displaying the result
ped_df_copy.head()
Out[96]:
Date hourday pedestriancount sensor_name latitude_x longitude_x Street Weekday date hour ... canopy_coverage_ratio_y canopy_scaled_y ped_scaled avg_ped_count canopy_coverage_ratio canopy_scaled stress_index stress_category walkability_score walkability_category
0 2024-09-17 17 0.767691 Eli250_T 0.421870 0.730789 Elizabeth St Tuesday 2024-09-17 17 ... 0.17599 0.17599 0.644854 0.625363 0.57623 0.17599 0.606819 Moderate Stress 0.396632 Moderate Walkability
1 2024-09-17 17 0.565273 HarEsP_T 0.363941 0.324779 Harbour Esplanade Tuesday 2024-09-17 17 ... 0.00000 0.00000 0.215811 0.253209 0.00000 0.00000 0.239285 Low Stress 0.297890 Low Walkability
2 2024-09-17 17 0.385963 574Qub_T 0.722158 0.430477 Queensberry St Tuesday 2024-09-17 17 ... 0.00000 0.00000 0.013825 0.021674 0.00000 0.00000 0.015634 Low Stress 0.288562 Low Walkability
3 2024-09-17 17 0.615960 Lon189_T 0.465123 0.819581 Lonsdale St Tuesday 2024-09-17 17 ... 0.00000 0.00000 0.465651 0.471728 0.00000 0.00000 0.502376 Low Stress 0.313501 Low Walkability
4 2024-09-17 17 0.503255 Lyg161_T 0.766581 0.820049 Lygon St Tuesday 2024-09-17 17 ... 0.00000 0.00000 0.223535 0.260504 0.00000 0.00000 0.247653 Low Stress 0.298295 Low Walkability

5 rows × 40 columns

The above output data set makes it easy to interpret the data and identify which areas are most pedestrian-friendly. The dataset can be used for further analysis and model building.

Plotting Histogram for Walkability Score

In [130]:
ped_df_copy['walkability_score'].hist(figsize=(8, 6), bins=20, color='#ffdd99')

plt.suptitle('Histogram for Walkability Score', fontsize=16)
plt.xlabel('Walkability Score')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()
No description has been provided for this image

According to the above histogram most of the walkability score is concentrated around modereate level

Plotting Correlation HeatMap

In [99]:
# Selecting only numeric columns for correlation
numeric_cols_df = ped_df_copy[['walkability_score', 'pedestriancount', 
                               'latitude_x', 'longitude_x', 'ped_scaled', 'weather_index','avg_ped_count', 'canopy_coverage_ratio', 'stress_index' ]]

# Compute correlation matrix
corr_matrix = numeric_cols_df.corr()

# Plot the matrix
plt.figure(figsize=(12,8))

# Create the matrix plot
plt.matshow(corr_matrix, cmap="coolwarm", fignum=1)  
 
plt.title('Correlation Heatmap for Walkability Score', fontsize=16, pad=20)
plt.xticks(range(len(corr_matrix.columns)), corr_matrix.columns, rotation="vertical", ha='right', fontsize=10)
plt.yticks(range(len(corr_matrix.columns)), corr_matrix.columns)

plt.gca().xaxis.set_ticks_position('bottom')

plt.colorbar()

for i in range(len(corr_matrix.columns)):
    for j in range(len(corr_matrix.columns)):
        plt.text(j, i, f"{corr_matrix.iloc[i, j]:.2f}", ha="center", va="center", color="w")

plt.show()
No description has been provided for this image

The correlation heatmap for the walkability score reveals key relationships among the contributing variables. The walkability score shows a strong positive correlation with the stress index (0.77), reflecting its dependence on pedestrian density and canopy coverage where lower stress (from more tree canopy and less crowding) boosts walkability. It also moderately correlates with the weather index (0.41), indicating that favorable weather improves walkability. The stress index is highly correlated with both scaled pedestrian count and average pedestrian count (0.97), confirming that areas with high foot traffic and low canopy coverage tend to have elevated stress. Conversely, the canopy coverage ratio has a negative correlation with stress (-0.41), highlighting its role in reducing environmental discomfort. Interestingly, raw pedestrian count has minimal correlation with walkability (-0.09), suggesting that density alone doesn't determine pedestrian friendliness contextual factors like shade and weather are crucial.

Plotting the composition of Walkability Catogery

In [100]:
# Get counts of each category
pedestrian_counts = ped_df_copy['walkability_category'].value_counts()

# Define labels and sizes for the pie chart
labels = pedestrian_counts.index
sizes = pedestrian_counts.values

# Define specific colors for each category
color_map = {'High Walkability': '#90EE90', 'Moderate Walkability': '#FFF5A5', 'Low Walkability': '#F8B4B4'}

colors = [color_map[label] for label in labels]

# Create the pie chart
fig, ax = plt.subplots()
wedges, texts, autotexts = ax.pie(sizes, colors=colors, labels=labels, autopct='%1.1f%%', startangle=90,
       wedgeprops={"linewidth": 1, "edgecolor": "grey"})  # Add outline here

ax.set(aspect='equal')

plt.title('Composition of Walkability Category')

# Display the pie chart
plt.show()
No description has been provided for this image

More than 70% are within moderate walkability catogery

Plotting Busiest Streets by Walkability Catogery

In [102]:
# Group by accessibility_category and count the occurrences
grouped = ped_df_copy.groupby(['Street', 'walkability_category']).size().unstack(fill_value=0)

grouped['total'] = grouped.sum(axis=1)

sorted_grouped = grouped.sort_values(by='total', ascending=False).drop(columns='total')

# Get top 10 streets
sorted_grouped = sorted_grouped.head(10)

percentages = sorted_grouped.divide(sorted_grouped.sum(axis=1), axis=0) * 100

# Plot the data
fig, ax = plt.subplots(figsize=(10, 8))
sorted_grouped.plot(kind='bar', stacked=True, ax=ax, color=[color_map[cat] for cat in sorted_grouped.columns])

for i, area in enumerate(sorted_grouped.index):
    max_contrib_idx = percentages.loc[area].idxmax()
    max_contrib_val = percentages.loc[area].max()
    
    ax.annotate(f'{max_contrib_val:.0f}%', 
                xy=(i, sorted_grouped.loc[area, :].cumsum()[max_contrib_idx] - sorted_grouped.loc[area, max_contrib_idx]/2), 
                ha='center', va='bottom', fontsize=10, color='black')

plt.title('Top 10 Streets by Pedestrian Walkability Category')
plt.xlabel('Street')
plt.ylabel('Number of Observations')
plt.xticks(rotation=45)
plt.legend(title='Walkability Score Category')
plt.tight_layout()
plt.show()
No description has been provided for this image

According to the above plot most busiest streets are having moderate and low walkability score

Plotting High Walkability Streets

In [103]:
# Group and count by Street and walkability_category
category_counts = ped_df_copy.groupby(['Street', 'walkability_category']).size().reset_index(name='count')

# Get top 10 streets for each walkability category
top_high = category_counts[category_counts['walkability_category'] == 'High Walkability'].nlargest(10, 'count')

def plot_top_streets(data, title, color):
    plt.figure(figsize=(8, 6))
    sns.barplot(data=data, x='Street', y='count', color=color)
    plt.title(title)
    plt.xlabel('Street')
    plt.ylabel('Frequency')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

# Plotting
plot_top_streets(top_high, 'High Walkability Streets ', color_map['High Walkability'])
No description has been provided for this image

According to the above bar plot Spring Strret and Rebecca Walk are the streets with High walkability scores.

Model Selection and Model Building.¶

Implementing Clustering Models

1. DBSCAN Clustering Model

The DBSCAN clustering applied for uncover spatial patterns in pedestrians by grouping street segments that share similar characteristics such as pedestrian activity, weather comfort, canopy coverage, and environmental stress. In the context of evaluating walkability and urban sustainability, clustering helps identify areas with consistent conditions, whether favorable or problematic.

The code calculates and prints the number of clusters and the number of noise points. It then evaluates the quality of the clustering using the silhouette score, a metric that ranges from -1 to 1 and indicates how well separated and cohesive the clusters are.

In [106]:
# Convert coordinates to numpy array
coords = np.array(ped_df_copy[['pedestriancount', 'latitude_x', 'longitude_x', 'ped_scaled', 'weather_index',
                                  'avg_ped_count', 'canopy_coverage_ratio', 'stress_index' ]])

# DBSCAN clustering 
dbscan = DBSCAN(eps=0.07, min_samples=5)  
ped_df_copy['dbscan_cluster'] = dbscan.fit_predict(coords)


# Plot clusters
fig, ax = plt.subplots(figsize=(12, 10))
ped_df_copy.plot(column='dbscan_cluster', ax=ax, legend=True, cmap='viridis')

# Add x and y labels
ax.set_xlabel('Longitude_x')
ax.set_ylabel('Latitude_x')

plt.title('DBSCAN Clustering of Street')
plt.show()

# Number of clusters and noise
num_clusters = len(set(ped_df_copy['dbscan_cluster'])) - (1 if -1 in ped_df_copy['dbscan_cluster'].values else 0)
num_noise = (ped_df_copy['dbscan_cluster'] == -1).sum()
print(f"Number of clusters: {num_clusters}")
print(f"Number of noise points: {num_noise}")

# Filter out noise points for silhouette score
filtered_coords = coords[ped_df_copy['dbscan_cluster'] != -1]
filtered_labels = ped_df_copy['dbscan_cluster'][ped_df_copy['dbscan_cluster'] != -1]

if len(set(filtered_labels)) > 1:
    score = silhouette_score(filtered_coords, filtered_labels)
    print(f"Silhouette Score: {score}")
else:
    print("Silhouette Score cannot be computed, insufficient number of clusters.")
No description has been provided for this image
Number of clusters: 60
Number of noise points: 175
Silhouette Score: 0.15884049520269838

The DBSCAN clustering resulted in 60 distinct clusters and 175 noise points, with a silhouette score of 0.1588, indicating relatively weak clustering. These clusters represent areas with similar patterns of pedestrian activity, tree canopy coverage, stress levels, and weather conditions, helping to spatially group urban environments based on walkability-related factors. The presence of 175 noise points suggests that some locations have unique or sparse characteristics that don't fit into the main clusters, potentially highlighting outlier zones that merit further investigation. Although the silhouette score is low, which means the clusters are not strongly separated, it still provides a useful foundation for identifying areas with comparable environmental and pedestrian dynamics, guiding targeted interventions for improving urban walkability and comfort.

2. KMeans Clustering Model

In [108]:
# Convert coordinates to numpy array
coords = np.array(ped_df_copy[['pedestriancount', 'latitude_x', 'longitude_x', 'ped_scaled', 'weather_index',
                                  'avg_ped_count', 'canopy_coverage_ratio', 'stress_index' ]])
# Number of clusters
n_clusters = 5

# K-means clustering
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
ped_df_copy['kmeans_cluster'] = kmeans.fit_predict(coords)

fig, ax = plt.subplots(figsize=(10, 10))
ped_df_copy.plot(column='kmeans_cluster', ax=ax, legend=True, cmap='viridis')

# Add x and y labels
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude') 

# Plot clusters
plt.title('K-means Clustering of Street')
plt.show()

# Evaluation metrics

# Silhouette Score
score = silhouette_score(coords, ped_df_copy['kmeans_cluster'])
print(f"Silhouette Score: {score}")

inertia = kmeans.inertia_
print(f"Inertia: {inertia}")

# Number of points in each cluster
cluster_counts = ped_df_copy['kmeans_cluster'].value_counts()
print("Number of points in each cluster:")
print(cluster_counts)
No description has been provided for this image
Silhouette Score: 0.23672547371813915
Inertia: 60213.89854916249
Number of points in each cluster:
kmeans_cluster
2    85689
1    80345
3    78237
0    72653
4    57559
Name: count, dtype: int64

The K-means clustering approach divided the dataset into 5 distinct clusters, each representing areas with similar characteristics such as pedestrian counts, location coordinates, canopy coverage, weather conditions, and stress levels. The silhouette score of 0.2367 indicates a moderate clustering quality, which is better than DBSCAN. The inertia value of 60,213.90 reflects the sum of squared distances between data points and their assigned cluster centers, with lower values generally indicating tighter clusters. The distribution of data points across clusters shows relatively balanced grouping, with Cluster 2 containing the most observations (85,689 points) and Cluster 4 the least (57,559 points). This clustering result helps to identify and differentiate urban zones based on multi-dimensional environmental and mobility features, which can support targeted policy making and urban planning to enhance walkability and reduce environmental stress.

Implementing Regression Models

1. Linear Regression Model

In [109]:
# Features and target variable
X = ped_df_copy[['pedestriancount', 'latitude_x', 'longitude_x', 'ped_scaled', 'weather_index',
                                  'avg_ped_count', 'canopy_coverage_ratio', 'stress_index' ]]
y = ped_df_copy['walkability_score']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print(f'MSE: {mean_squared_error(y_test, y_pred)}')
print(f'R-squared: {r2_score(y_test, y_pred)}')
MSE: 0.0007531016060040055
R-squared: 0.8911717417108225

The linear regression model was trained to predict the walkability score using features such as pedestrian count, geographic coordinates, scaled pedestrian activity, weather index, average pedestrian count, canopy coverage ratio, and stress index. After splitting the data into training and test sets (80/20), the model demonstrated a Mean Squared Error (MSE) of 0.00075, indicating that the predicted walkability scores were very close to the actual values, on average. Additionally, the model achieved a high R-squared value of 0.891, meaning it explains approximately 89.1% of the variance in walkability scores on the test data. This suggests the selected features are highly effective in modeling walkability and can be used reliably for predictive and planning purposes in urban mobility studies.

2. Logistic Regression Model

In [110]:
# Convert walkability score to binary classification
threshold = 0.5  
ped_df_copy['walkability_binary'] = (ped_df_copy['walkability_score'] > threshold).astype(int)
y = ped_df_copy['walkability_binary']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
print(f'Precision: {precision_score(y_test, y_pred)}')
print(f'Recall: {recall_score(y_test, y_pred)}')
print(f'F1-score: {f1_score(y_test, y_pred)}')

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Insufficient Walkability', 'Sufficient Walkability'])

plt.figure(figsize=(10, 7))
disp.plot(cmap='Blues')
plt.title('Confusion Matrix for Logistic Regression Model')
plt.show()

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:, 1])
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(12, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='red', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()
Accuracy: 0.9959811474424877
Precision: 0.9880090965474467
Recall: 0.9516129032258065
F1-score: 0.969469520235318
<Figure size 1000x700 with 0 Axes>
No description has been provided for this image
No description has been provided for this image

The abovelogistic regression model was trained and evaluated. The accuracy was 99.6%, showing that nearly all predictions were correct. Precision was 98.8%, indicating that when the model predicted sufficient walkability, it was correct most of the time. Recall was 95.2%, meaning it successfully identified most of the areas with sufficient walkability. The F1-score, which balances precision and recall, was 96.9%, confirming strong overall classification performance. The confusion matrix visually confirmed the low number of misclassifications, and the ROC curve with an AUC close to 1.0 demonstrated outstanding discriminatory ability. This classification model is highly reliable and valuable for identifying walkable areas in urban planning and mobility analysis.

3. Random Forest Regressor Model

In [111]:
# Fit model
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print(f'MSE: {mean_squared_error(y_test, y_pred)}')
print(f'R-squared: {r2_score(y_test, y_pred)}')

# Feature importance
importances = model.feature_importances_

# Create a DataFrame for feature importances
feature_importances_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.barh(feature_importances_df['Feature'], feature_importances_df['Importance'], color='skyblue')
plt.xlabel('Importance')
plt.title('Feature Importance in Random Forest Regressor')
plt.gca().invert_yaxis()
plt.show()
MSE: 1.9493437654378685e-07
R-squared: 0.9999968838471727
No description has been provided for this image

The Random Forest Regressor model performed well in predicting the walkability score, achieving a very low Mean Squared Error (MSE) of 0.00000019 and an R-squared value of 0.999997, which indicates an almost perfect fit. This means the model explains nearly all the variance in walkability scores, showcasing its high predictive power. The feature importance plot reveals how each input variable contributed to the model’s prediction. The most important feature is canopy coverage ratio.

  • Linear Regression model Predicts walkability score with good accuracy (R² = 0.89, MSE = 0.00075), but has limitations in capturing complex patterns.

  • Logistic Regression model Classifies walkability score with good accuracy (Accuracy = 99.6%, F1 = 96.9%) hence this is suiatable for threshold-based decisions.

  • Random Forest Regressor model Provides better predictions (R² = 0.999997, MSE ≈ 0) and good for capturing complex nonlinear relationships.

Implementing Deep learning approach using an FFNN to predict Walkability Score

A Feedforward Neural Network (FFNN) was implemented with multiple dense layers to model complex nonlinear relationships between pedestrian activity, environmental features, and walkability. After scaling the input features and applying early stopping to prevent overfitting, the FFNN was trained to classify walkability (binary) using a sigmoid activation function. The model effectively learned from spatial, behavioral, and environmental data to predict areas of sufficient or insufficient walkability.

In [114]:
# Pearson Correlation
def pearson_correlation(y_true, y_pred):
    y_true = tf.cast(y_true, tf.float32)
    y_pred = tf.cast(y_pred, tf.float32)
    
    x_mean = K.mean(y_true)
    y_mean = K.mean(y_pred)
    x_var = K.mean(K.square(y_true - x_mean))
    y_var = K.mean(K.square(y_pred - y_mean))
    covariance = K.mean((y_true - x_mean) * (y_pred - y_mean))
    return covariance / (K.sqrt(x_var) * K.sqrt(y_var))

# Euclidean Distance
def euclidean_distance(y_true, y_pred):
    y_true = tf.cast(y_true, tf.float32)
    y_pred = tf.cast(y_pred, tf.float32)
    return K.sqrt(K.sum(K.square(y_true - y_pred)))


ped_df['walkability_binary'] = (ped_df['walkability_score'] > threshold).astype(int)
y = ped_df_copy['walkability_binary']

# Features for the model
features = ['walkability_score', 'pedestriancount', 'latitude_x', 'longitude_x', 'ped_scaled', 'weather_index',
                                  'avg_ped_count', 'canopy_coverage_ratio', 'stress_index']

# Feature matrix
X = ped_df_copy[features]

# Scaling the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Build the FFNN model
model = Sequential()
model.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile the model with custom metrics
model.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])

# Print the model summary
model.summary()

# Early stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Fit the model
history = model.fit(X_train, y_train, epochs=100, batch_size=32, validation_split=0.2, verbose=1, callbacks=[early_stopping])

# Predict
y_pred_prob = model.predict(X_test)
y_pred = (y_pred_prob > 0.5).astype(int)

# Plot training and validation metrics
plt.figure(figsize=(14, 6))
C:\Users\chath\AppData\Roaming\Python\Python311\site-packages\keras\src\layers\core\dense.py:87: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ dense (Dense)                        │ (None, 64)                  │             640 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_1 (Dense)                      │ (None, 32)                  │           2,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_2 (Dense)                      │ (None, 1)                   │              33 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 2,753 (10.75 KB)
 Trainable params: 2,753 (10.75 KB)
 Non-trainable params: 0 (0.00 B)
Epoch 1/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 18s 2ms/step - accuracy: 0.9900 - loss: 0.0341 - val_accuracy: 0.9969 - val_loss: 0.0072
Epoch 2/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 12s 2ms/step - accuracy: 0.9985 - loss: 0.0037 - val_accuracy: 0.9988 - val_loss: 0.0031
Epoch 3/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 11s 1ms/step - accuracy: 0.9988 - loss: 0.0032 - val_accuracy: 0.9986 - val_loss: 0.0033
Epoch 4/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 14s 2ms/step - accuracy: 0.9989 - loss: 0.0027 - val_accuracy: 0.9991 - val_loss: 0.0021
Epoch 5/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 17s 2ms/step - accuracy: 0.9990 - loss: 0.0024 - val_accuracy: 0.9994 - val_loss: 0.0016
Epoch 6/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 14s 2ms/step - accuracy: 0.9991 - loss: 0.0024 - val_accuracy: 0.9996 - val_loss: 0.0013
Epoch 7/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 12s 2ms/step - accuracy: 0.9990 - loss: 0.0025 - val_accuracy: 0.9991 - val_loss: 0.0018
Epoch 8/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 11s 1ms/step - accuracy: 0.9992 - loss: 0.0020 - val_accuracy: 0.9992 - val_loss: 0.0017
Epoch 9/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 14s 2ms/step - accuracy: 0.9992 - loss: 0.0020 - val_accuracy: 0.9994 - val_loss: 0.0012
Epoch 10/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 15s 2ms/step - accuracy: 0.9992 - loss: 0.0019 - val_accuracy: 0.9996 - val_loss: 0.0013
Epoch 11/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 13s 2ms/step - accuracy: 0.9992 - loss: 0.0018 - val_accuracy: 0.9993 - val_loss: 0.0015
Epoch 12/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 12s 2ms/step - accuracy: 0.9992 - loss: 0.0019 - val_accuracy: 0.9988 - val_loss: 0.0021
Epoch 13/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 15s 2ms/step - accuracy: 0.9993 - loss: 0.0017 - val_accuracy: 0.9995 - val_loss: 0.0011
Epoch 14/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 15s 2ms/step - accuracy: 0.9993 - loss: 0.0016 - val_accuracy: 0.9984 - val_loss: 0.0052
Epoch 15/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 14s 2ms/step - accuracy: 0.9992 - loss: 0.0018 - val_accuracy: 0.9994 - val_loss: 0.0011
Epoch 16/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 17s 2ms/step - accuracy: 0.9994 - loss: 0.0014 - val_accuracy: 0.9997 - val_loss: 7.0937e-04
Epoch 17/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 17s 2ms/step - accuracy: 0.9994 - loss: 0.0015 - val_accuracy: 0.9986 - val_loss: 0.0032
Epoch 18/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 11s 2ms/step - accuracy: 0.9994 - loss: 0.0014 - val_accuracy: 0.9993 - val_loss: 0.0024
Epoch 19/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 14s 2ms/step - accuracy: 0.9993 - loss: 0.0016 - val_accuracy: 0.9986 - val_loss: 0.0039
Epoch 20/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 14s 2ms/step - accuracy: 0.9994 - loss: 0.0014 - val_accuracy: 0.9998 - val_loss: 6.9379e-04
Epoch 21/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 14s 2ms/step - accuracy: 0.9995 - loss: 0.0014 - val_accuracy: 0.9991 - val_loss: 0.0013
Epoch 22/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 15s 2ms/step - accuracy: 0.9994 - loss: 0.0013 - val_accuracy: 0.9994 - val_loss: 0.0013
Epoch 23/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 14s 2ms/step - accuracy: 0.9995 - loss: 0.0012 - val_accuracy: 0.9995 - val_loss: 9.8364e-04
Epoch 24/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 14s 2ms/step - accuracy: 0.9995 - loss: 0.0012 - val_accuracy: 0.9997 - val_loss: 6.0455e-04
Epoch 25/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 14s 2ms/step - accuracy: 0.9996 - loss: 0.0011 - val_accuracy: 0.9993 - val_loss: 0.0017
Epoch 26/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 14s 2ms/step - accuracy: 0.9996 - loss: 0.0012 - val_accuracy: 0.9998 - val_loss: 7.4344e-04
Epoch 27/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 15s 2ms/step - accuracy: 0.9996 - loss: 0.0011 - val_accuracy: 0.9997 - val_loss: 8.0794e-04
Epoch 28/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 14s 2ms/step - accuracy: 0.9995 - loss: 0.0011 - val_accuracy: 0.9997 - val_loss: 7.5956e-04
Epoch 29/100
7490/7490 ━━━━━━━━━━━━━━━━━━━━ 13s 2ms/step - accuracy: 0.9996 - loss: 0.0011 - val_accuracy: 0.9995 - val_loss: 9.3675e-04
2341/2341 ━━━━━━━━━━━━━━━━━━━━ 4s 2ms/step
Out[114]:
<Figure size 1400x600 with 0 Axes>
<Figure size 1400x600 with 0 Axes>

Evaluating FFNN

In [116]:
# Plot Loss
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Val Loss')
plt.xlabel('Epochs')
plt.ylabel('Binary Crossentropy Loss')
plt.title('Loss Over Epochs')
plt.legend()
# Plot Accuracy
plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Val Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.title('Accuracy Over Epochs')
plt.legend()

plt.show()

# Evaluate the model
score = model.evaluate(X_test, y_test)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
No description has been provided for this image
2341/2341 ━━━━━━━━━━━━━━━━━━━━ 3s 1ms/step - accuracy: 0.9997 - loss: 6.5286e-04
Test loss: 0.0006515420973300934
Test accuracy: 0.9997329711914062

The plotted training curves show that the Feedforward Neural Network (FFNN) achieved good performance. Both training and validation loss decreased steadily, indicating the model learned efficiently without overfitting. Accuracy remained consistently high, with final test accuracy reaching 99.97%, and binary loss as low as 0.00065, this confirms the model’s robustness in classifying walkability. These results demonstrate the FFNN’s strong capability to capture complex patterns in urban mobility and pedestrian data.

Outputs

1. A single data set for each area or street with pedestrian count , weather index , weather catogery, conopy coverage, stress index, stress catogery, walkability score and walkability catogery

In [126]:
ped_df_final = ped_df_copy[['Street', 'latitude_x', 'longitude_x', 'weather_index', 'canopy_coverage_ratio','stress_index','walkability_score' , 'walkability_category']]
ped_df_final.head()
Out[126]:
Street latitude_x longitude_x weather_index canopy_coverage_ratio stress_index walkability_score walkability_category
0 Elizabeth St 0.421870 0.730789 0.442105 0.57623 0.606819 0.396632 Moderate Walkability
1 Harbour Esplanade 0.363941 0.324779 0.442105 0.00000 0.239285 0.297890 Low Walkability
2 Queensberry St 0.722158 0.430477 0.442105 0.00000 0.015634 0.288562 Low Walkability
3 Lonsdale St 0.465123 0.819581 0.442105 0.00000 0.502376 0.313501 Low Walkability
4 Lygon St 0.766581 0.820049 0.442105 0.00000 0.247653 0.298295 Low Walkability

2. Top areas for each walkability catogery

In [128]:
# Get unique categories
categories = ped_df_final['walkability_category'].unique()

# Create a dictionary to hold top 10 tables per category
top10_tables = {}

for cat in categories:
    top10_tables[cat] = ped_df_final[ped_df_final['walkability_category'] == cat].head(10)

# Display each table separately
for cat, table in top10_tables.items():
    print(f"Top 10 rows for category: {cat}")
    display(table)  # if in Jupyter notebook; use print(table) otherwise
    print("\n")
Top 10 rows for category: Moderate Walkability
Street latitude_x longitude_x weather_index canopy_coverage_ratio stress_index walkability_score walkability_category
0 Elizabeth St 0.421870 0.730789 0.442105 0.576230 0.606819 0.396632 Moderate Walkability
6 Flinders St 0.340564 1.000000 0.442105 0.585423 0.480664 0.382005 Moderate Walkability
7 Swanston St 0.323423 0.824884 0.442105 0.000000 0.700758 0.331538 Moderate Walkability
8 Convention Centre Pl 0.059897 0.585421 0.442105 0.000000 0.736216 0.335733 Moderate Walkability
9 King St 0.418966 0.536919 0.442105 0.756715 0.202690 0.396620 Moderate Walkability
10 Flinders St 0.251037 0.785443 0.442105 0.585423 0.480664 0.382005 Moderate Walkability
13 La Trobe St 0.503909 0.946582 0.442105 0.433460 0.286308 0.339183 Moderate Walkability
14 Bourke St 0.383472 0.894996 0.442105 0.641712 0.249957 0.370674 Moderate Walkability
15 Batman Ave 0.237309 0.975581 0.442105 0.925300 0.044197 0.466964 Moderate Walkability
16 Flinders St 0.303361 0.916082 0.442105 0.585423 0.480664 0.382005 Moderate Walkability

Top 10 rows for category: Low Walkability
Street latitude_x longitude_x weather_index canopy_coverage_ratio stress_index walkability_score walkability_category
1 Harbour Esplanade 0.363941 0.324779 0.442105 0.0 0.239285 0.297890 Low Walkability
2 Queensberry St 0.722158 0.430477 0.442105 0.0 0.015634 0.288562 Low Walkability
3 Lonsdale St 0.465123 0.819581 0.442105 0.0 0.502376 0.313501 Low Walkability
4 Lygon St 0.766581 0.820049 0.442105 0.0 0.247653 0.298295 Low Walkability
5 Lonsdale St 0.335448 0.501613 0.442105 0.0 0.502376 0.313501 Low Walkability
11 toria St 0.628160 0.594380 0.442105 0.0 0.291740 0.300511 Low Walkability
12 Lygon St 0.881029 0.833867 0.442105 0.0 0.247653 0.298295 Low Walkability
18 Little Bourke St 0.448956 0.856929 0.442105 0.0 0.448458 0.309724 Low Walkability
23 Docklands 0.324824 0.221908 0.442105 0.0 0.188021 0.295508 Low Walkability
25 Docklands 0.358701 0.293488 0.442105 0.0 0.188021 0.295508 Low Walkability

Top 10 rows for category: High Walkability
Street latitude_x longitude_x weather_index canopy_coverage_ratio stress_index walkability_score walkability_category
198 Spring Street 0.388919 0.967944 0.600000 1.0 0.0 0.684441 High Walkability
205 Rebecca Walk 0.132391 0.574872 0.600000 1.0 0.0 0.672399 High Walkability
229 Rebecca Walk 0.132391 0.574872 0.631579 1.0 0.0 0.678399 High Walkability
239 Spring Street 0.388919 0.967944 0.631579 1.0 0.0 0.690441 High Walkability
307 Rebecca Walk 0.132391 0.574872 0.652632 1.0 0.0 0.682399 High Walkability
352 Spring Street 0.388919 0.967944 0.652632 1.0 0.0 0.694441 High Walkability
540 Rebecca Walk 0.132391 0.574872 0.684211 1.0 0.0 0.688399 High Walkability
598 Spring Street 0.388919 0.967944 0.684211 1.0 0.0 0.700441 High Walkability
849 Spring Street 0.388919 0.967944 0.736842 1.0 0.0 0.710441 High Walkability
863 Rebecca Walk 0.132391 0.574872 0.736842 1.0 0.0 0.698399 High Walkability

Findings¶

The use case demonstrates mobility analysis of the City of Melbourne area. According to the study the walkability in the city is strongly influenced by a combination of pedestrian traffic, tree canopy coverage and, and weather conditions. High walkability zones are characterised by higher pedestrian activity, ample canopy coverage, and lower stress indices, creating comfortable and accessible environments. Moderate walkability areas show more variability in these factors, while low walkability regions suffer from sparse pedestrian presence, limited canopy, and higher stress, reducing pedestrian comfort and accessibility. The clustering and predictive models further validate these patterns, offering valuable insights for urban planners to enhance pedestrian friendly infrastructure and prioritize green cover to improve walkability and overall urban livability.

Further this analysis can be use for urban planning and policy making to design safer, more comfortable pedestrian zones by optimizing tree planting and mitigating environmental stressors by targeted allocation of resources for improving infrastructure in low walkability areas. Further more real-time monitoring and forecasting of pedestrian comfort using weather and stress indices to guide city services, integration with public health initiatives to promote active transportation, and supporting smart city projects by incorporating walkability metrics into mobility apps, navigation systems, and community engagement platforms for enhanced sustainable urban living.

Future analysis can be performed by adding factors like sidewalk quality, safety, and public transport access. Advanced spatial temporal and deep learning models could better capture pedestrian behavior and variations. Can be improve model interpretability with explainable AI tools and enabling scenario simulations would help urban planners prioritize interventions effectively. Developing interactive visualization platforms and expanding the framework to consider accessibility for different groups would increase its usability and impact in creating more pedestrian friendly urban environments.

References¶

City of Melbourne https://www.melbourne.vic.gov.au/economic-development-strategy-2031

Development and evaluation of a real-time pedestrian counting system for high-volume conditions based on 2D LiDAR. Lesani A. ,n Nateghin E.,, Luis F. Miranda-Morenohttps://www.sciencedirect.com/science/article/abs/pii/S0968090X1930083X

In [ ]: